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Preface 



The origins of this book lie in our earlier book Random Processes: 
A Mathematical Approach for Engineers , Prentice Hall, 1986. This 
book began as a second edition to the earlier book and the basic 
goal remains unchanged — to introduce the fundamental ideas and 
mechanics of random processes to engineers in a way that accurately 
reflects the underlying mathematics, but does not require an exten- 
sive mathematical background and does not belabor detailed general 
proofs when simple cases suffice to get the basic ideas across. In the 
years since the original book was published, however, it has evolved 
into something bearing little resemblence to its ancestor. Numerous 
improvements in the presentation of the material have been sug- 
gested by colleagues, students, teaching assistants, reviewers, and by 
our own teaching experience. The emphasis of the book shifted in- 
creasingly towards examples and a viewpoint that better reflected the 
title of the courses we taught for many years at Stanford University 
and at the University of Maryland using the book: An Introduction 
to Statistical Signal Processing. Much of the basic content of this 
course and of the fundamentals of random processes can be viewed 
as the analysis of statistical signal processing systems: typically one 
is given a probabilistic description for one random object, which can 
be considered as an input signal An operation is applied to the in- 
put signal ( signal processing) to produce a new random object, the 
output signal Fundamental issues include the nature of the basic 
probabilistic description and the derivation of the probabilistic de- 
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scription of the output signal given that of the input signal and a 
description of the particular operation performed. A perusal of the 
literature in statistical signal processing, communications, control, 
image and video processing, speech and audio processing, medical 
signal processing, geophysical signal processing, and classical statis- 
tical areas of time series analysis, classification and regression, and 
pattern recognition show a wide variety of probabilistic models for 
input processes and for operations on those processes, where the op- 
erations might be deterministic or random, natural or artificial, linear 
or nonlinear, digital or analog, or beneficial or harmful. An introduc- 
tory course focuses on the fundamentals underlying the analysis of 
such systems: the theories of probability, random processes, systems, 
and signal processing. 

When the original book went out of print, the time seemed ripe 
to convert the manuscript from the prehistoric troff format to the 
widely used lAT^jX format and to undertake a serious revision of the 
book in the process. As the revision became more extensive, the 
title changed to match the course name and content. We reprint 
the original preface to provide some of the original motivation for 
the book, and then close this preface with a description of the goals 
sought during the many subsequent revisions. 



Preface to Random Processes: An Introduction for 

Engineers 

Nothing in nature is random ... A thing appears random only 
through the incompleteness of our knowledge. — Spinoza, Ethics 
I 

I do not believe that God rolls dice. — attributed to Einstein 

Laplace argued to the effect that given complete knowledge of the 
physics of an experiment, the outcome must always be predictable. 
This metaphysical argument must be tempered with several facts. 
The relevant parameters may not be measurable with sufficient pre- 
cision due to mechanical or theoretical limits. For example, the un- 
certainty principle prevents the simultaneous accurate knowledge of 
both position and momentum. The deterministic functions may be 
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too complex to compute in finite time. The computer itself may 
make errors due to power failures, lightning, or the general perfidy 
of inanimate objects. The experiment could take place in a remote 
location with the parameters unknown to the observer; for example, 
in a communication link, the transmitted message is unknown a pri- 
ori, for if it were not, there would be no need for communication. The 
results of the experiment could be reported by an unreliable witness 
- either incompetent or dishonest. For these and other reasons, it 
is useful to have a theory for the analysis and synthesis of processes 
that behave in a random or unpredictable manner. The goal is to 
construct mathematical models that lead to reasonably accurate pre- 
diction of the long-term average behavior of random processes. The 
theory should produce good estimates of the average behavior of real 
processes and thereby correct theoretical derivations with measur- 
able results. 

In this book we attempt a development of the basic theory and 
applications of random processes that uses the language and view- 
point of rigorous mathematical treatments of the subject but which 
requires only a typical bachelor’s degree level of electrical engineering 
education including elementary discrete and continuous time linear 
systems theory, elementary probability, and transform theory and ap- 
plications. Detailed proofs are presented only when within the scope 
of this background. These simple proofs, however, often provide 
the groundwork for “handwaving” justifications of more general and 
complicated results that are semi-rigorous in that they can be made 
rigorous by the appropriate delta-epsilontics of real analysis or mea- 
sure theory. A primary goal of this approach is thus to use intuitive 
arguments that accurately reflect the underlying mathematics and 
which will hold up under scrutiny if the student continues to more 
advanced courses. Another goal is to enable the student who might 
not continue to more advanced courses to be able to read and gener- 
ally follow the modern literature on applications of random processes 
to information and communication theory, estimation and detection, 
control, signal processing, and stochastic systems theory. 
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Through the years the original book has continually expanded to 
roughly double its original size to include more topics, examples, 
and problems. The material has been significantly reorganized in 
its grouping and presentation. Prerequisites and preliminaries have 
been moved to the appendices. Major additional material has been 
added on jointly Gaussian vectors, minimum mean squared error es- 
timation, detection and classification, filtering, and, most recently, 
mean square calculus and its applications to the analysis of contin- 
uous time processes. The index has been steadily expanded to ease 
navigation through the book. Numerous errors reported by reader 
email have been fixed and suggestions for clarifications and improve- 
ments incorporated. 

This book is a work in progress. Revised versions will be made 
available through the World Wide Web page 

http: //ee . stanford.edu/~gray/sp.html 
The material is copyrighted by the University of Cambridge Press, 
but is freely available to any who wish to use it provided 
only that the contents of the entire text remain intact and to- 
gether. Comments, corrections, and suggestions should be sent to 
rmgray@stanf ord. edu. Every effort will be made to fix typos and 
take suggestions into an account on at least an annual basis. 
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Glossary 



{ } a collection of points satisfying some property, e.g., {r : r < a} 
is the collection of all real numbers less than or equal to a value a 

] an interval of real points including the end points, e.g., for 
a < b [a,b\ = {r : a < r < b}. Called a closed interval. 

( ) an interval of real points excluding the end points, e.g., for 
a < b (a,b) = {r : a < r < b}. Called an open interval. Note this is 
empty if a = b. 

( ], [ ) denote intervals of real points including one endpoint 
and excluding the other, e.g., for a < b (a, 6] = {r : a < r < &}, 
[a, 6) = {r : a < r < b}. 

0 the empty set, the set that contains no points. 

V for all. 

f] the sample space or universal set, the set that contains all of 
the points. 

0=(F) the number of elements in a set F 
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Glossary 



exp the exponential function, exp(x) = e x , used for clarity when x 
is complicated. 

T Sigma-field or event space 

B(Gl) Borel field of 12, that is, the sigma-field of subsets of the 
real line generated by the intervals or the Cartesian product of a 
collection of such sigma- fields. 

iff if and only if 

l.i.m. limit in the mean 

o(u) function of u that goes to zero as u — > 0 faster than u. 

P probability measure 

Px distribution of a random variable or vector X 
px probability mass function (pmf) of a random variable X 
fx probability density function (pdf) of a random variable X 
Fx cumulative distribution function (cdf) of a random variable X 
E(X) expectation of a random variable X 
Mx(ju) characteristic function of a random variable X 
© addition modulo 2 

1 p(x) indicator function of a set F: 1 p(x) = 1 if x G F and 0 
otherwise 

$ Phi function (Eq. (2.78)) 

Q Complementary Phi function (Eq. (2.79)) 
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A 



Z* = {0,l,2,...,fc-1} 



A 



Z+ = {0,1,2,...}, the collection of nonnegative integers 



A 



Z = , —2, —1,0, 1,2, . . .}, the collection of all integers 
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Introduction 



A random or stochastic process is a mathematical model for a phe- 
nomenon that evolves in time in an unpredictable manner from the 
viewpoint of the observer. The phenomenon may be a sequence of 
real-valued measurements of voltage or temperature, a binary data 
stream from a computer, a modulated binary data stream from a 
modem, a sequence of coin tosses, the daily Dow-Jones average, ra- 
diometer data or photographs from deep space probes, a sequence 
of images from a cable television, or any of an infinite number of 
possible sequences, waveforms, or signals of any imaginable type. It 
may be unpredictable due to such effects as interference or noise in a 
communication link or storage medium, or it may be an information- 
bearing signal-deterministic from the viewpoint of an observer at the 
transmitter but random to an observer at the receiver. 

The theory of random processes quantifies the above notions so 
that one can construct mathematical models of real phenomena that 
are both tractable and meaningful in the sense of yielding useful 
predictions of future behavior. Tractability is required in order for 
the engineer (or anyone else) to be able to perform analyses and 
syntheses of random processes, perhaps with the aid of computers. 
The “meaningful” requirement is that the models must provide a 
reasonably good approximation of the actual phenomena. An over- 
simplified model may provide results and conclusions that do not 
apply to the real phenomenon being modeled. An overcomplicated 
one may constrain potential applications, render theory too difficult 
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to be useful, and strain available computational resources. Perhaps 
the most distinguishing characteristic between an average engineer 
and an outstanding engineer is the ability to derive effective models 
providing a good balance between complexity and accuracy. 

Random processes usually occur in applications in the context of 
environments or systems which change the processes to produce other 
processes. The intentional operation on a signal produced by one pro- 
cess, an “input signal,” to produce a new signal, an “output signal,” 
is generally referred to as signal processing , a topic easily illustrated 
by examples. 

• A time varying voltage waveform is produced by a human speaking into 
a microphone or telephone. This signal can be modeled by a random pro- 
cess. This signal might be modulated for transmission, then it might be 
digitized and coded for transmission on a digital link. Noise in the digital 
link can cause errors in reconstructed bits, the bits can then be used to 
reconstruct the original signal within some fidelity. All of these operations 
on signals can be considered as signal processing, although the name is 
most commonly used for manmade operations such as modulation, digiti- 
zation, and coding, rather than the natural possibly unavoidable changes 
such as the addition of thermal noise or other changes out of our control. 

• For very low bit rate digital speech communication applications, speech 
is sometimes converted into a model consisting of a simple linear filter 
(called an autoregressive filter) and an input process. The idea is that 
the parameters describing the model can be communicated with fewer bits 
than can the original signal, but the receiver can synthesize the human 
voice at the other end using the model so that it sounds very much like 
the original signal. 

• Signals including image data transmitted from remote spacecraft are vir- 
tually buried in noise added to them on route and in the front end ampli- 
fiers of the receivers used to retrieve the signals. By suitably preparing 
the signals prior to transmission, by suitable filtering of the received sig- 
nal plus noise, and by suitable decision or estimation rules, high quality 
images are transmitted through this very poor channel. 

• Signals produced by biomedical measuring devices can display specific 
behavior when a patient suddenly changes for the worse. Signal processing 
systems can look for these changes and warn medical personnel when 
suspicious behavior occurs. 

• Images produced by laser cameras inside elderly North Atlantic pipelines 
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can be automatically analyzed to locate possible anomolies indicating 
corrosion by looking for locally distinct random behavior. 

How are these signals characterized? If the signals are random, 
how does one find stable behavior or structures to describe the pro- 
cesses? How do operations on these signals change them? How can 
one use observations based on random signals to make intelligent 
decisions regarding future behavior? All of these questions lead to 
aspects of the theory and application of random processes. 

Courses and texts on random processes usually fall into either of 
two general and distinct categories. One category is the common 
engineering approach, which involves fairly elementary probability 
theory, standard undergraduate Riemann calculus, and a large dose 
of “cookbook” formulas — often with insufficient attention paid to 
conditions under which the formulas are valid. The results are of- 
ten justified by nonrigorous and occasionally mathematically inac- 
curate handwaving or intuitive plausibility arguments that may not 
reflect the actual underlying mathematical structure and may not 
be supportable by a precise proof. While intuitive arguments can 
be extremely valuable in providing insight into deep theoretical re- 
sults, they can be a handicap if they do not capture the essence of a 
rigorous proof. 

A development of random processes that is insufficiently mathe- 
matical leaves the student ill prepared to generalize the techniques 
and results when faced with a real-world example not covered in the 
text. For example, if one is faced with the problem of designing 
signal processing equipment for predicting or communicating mea- 
surements being made for the first time by a space probe, how does 
one construct a mathematical model for the physical process that 
will be useful for analysis? If one encounters a process that is nei- 
ther stationary nor ergodic, what techniques still apply? Can the 
law of large numbers still be used to construct a useful model? 

An additional problem with an insufficiently mathematical devel- 
opment is that it does not leave the student adequately prepared to 
read modern literature such as the many Transactions of the IEEE. 
The more advanced mathematical language of recent work is increas- 
ingly used even in simple cases because it is precise and universal and 
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focuses on the structure common to all random processes. Even if an 
engineer is not directly involved in research, knowledge of the current 
literature can often provide useful ideas and techniques for tackling 
specific problems. Engineers unfamiliar with basic concepts such as 
sigma- field and conditional expectation will find many potentially 
valuable references shrouded in mystery. 

The other category of courses and texts on random processes is the 
typical mathematical approach, which requires an advanced mathe- 
matical background of real analysis, measure theory, and integration 
theory. This approach involves precise and careful theorem state- 
ments and proofs, and uses far more care to specify precisely the 
conditions required for a result to hold. Most engineers do not, how- 
ever, have the required mathematical background, and the extra care 
required in a completely rigorous development severely limits the 
number of topics that can be covered in a typical course — in par- 
ticular, the applications that are so important to engineers tend to 
be neglected. In addition, too much time is spent with the formal 
details, obscuring the often simple and elegant ideas behind a proof. 
Often little, if any, physical motivation for the topics is given. 

This book attempts a compromise between the two approaches 
by giving the basic, elementary theory and a profusion of examples 
in the language and notation of the more advanced mathematical 
approaches. The intent is to make the crucial concepts clear in the 
traditional elementary cases, such as coin flipping, and thereby to 
emphasize the mathematical structure of all random processes in the 
simplest possible context. The structure is then further developed by 
numerous increasingly complex examples of random processes that 
have proved useful in systems analysis. The complicated examples 
are constructed from the simple examples by signal processing, that 
is, by using a simple process as an input to a system whose output 
is the more complicated process. This has the double advantage of 
describing the action of the system, the actual signal processing, and 
the interesting random process which is thereby produced. As one 
might suspect, signal processing also can be used to produce simple 
processes from complicated ones. 

Careful proofs are usually constructed only in elementary cases. 
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For example, the fundamental theorem of expectation is proved only 
for discrete random variables, where it is proved simply by a change 
of variables in a sum. The continuous analog is subsequently given 
without a careful proof, but with the explanation that it is simply the 
integral analog of the summation formula and hence can be viewed 
as a limiting form of the discrete result. As another example, only 
weak laws of large numbers are proved in detail in the mainstream 
of the text, but the strong law is treated in detail for a special case 
in a starred section. Starred sections are used to delve into other 
relatively advanced results, for example the use of mean square con- 
vergence ideas to make rigorous the notion of integration and filtering 
of continuous time processes. 

By these means we strive to capture the spirit of important proofs 
without undue tedium and to make plausible the required assump- 
tions and constraints. This, in turn, should aid the student in deter- 
mining when certain tools do or do not apply and what additional 
tools might be necessary when new generalizations are required. 

A distinct aspect of the mathematical viewpoint is the “grand ex- 
periment” view of random processes as being a probability measure 
on sequences (for discrete time) or waveforms (for continuous time) 
rather than being an infinity of smaller experiments representing in- 
dividual outcomes (called random variables) that are somehow glued 
together. From this point of view random variables are merely special 
cases of random processes. In fact, the grand experiment viewpoint 
was popular in the early days of applications of random processes 
to systems and was called the “ensemble” viewpoint in the work of 
Norbert Wiener and his students. By viewing the random process as 
a whole instead of as a collection of pieces, many basic ideas, such 
as stationarity and ergodicity, that characterize the dependence on 
time of probabilistic descriptions and the relation between time aver- 
ages and probabilistic averages are much easier to define and study. 
This also permits a more complete discussion of processes that vi- 
olate such probabilistic regularity requirements yet still have useful 
relations between time and probabilistic averages. 

Even though a student completing this book will not be able to 
follow the details in the literature of many proofs of results involving 
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random processes, the basic results and their development and im- 
plications should be accessible, and the most common examples of 
random processes and classes of random processes should be familiar. 
In particular, the student should be well equipped to follow the gist 
of most arguments in the various Transactions of the IEEE dealing 
with random processes, including the IEEE Transactions on Signal 
Processing , IEEE Transactions on Image Processing , IEEE Transac- 
tions on Speech and Audio Processing , IEEE Transactions on Com- 
munications , IEEE Transactions on Control , and IEEE Transactions 
on Information Theory. 

It also should be mentioned that the authors are electrical engi- 
neers and, as such, have written this text with an electrical engineer- 
ing flavor. However, the required knowledge of classical electrical 
engineering is slight, and engineers in other fields should be able to 
follow the material presented. 

This book is intended to provide a one-quarter or one-semester 
course that develops the basic ideas and language of the theory of 
random processes and provides a rich collection of examples of com- 
monly encountered processes, properties, and calculations. Although 
in some cases these examples may seem somewhat artificial, they are 
chosen to illustrate the way engineers should think about random 
processes. They are selected for simplicity and conceptual content 
rather than to present the method of solution to some particular ap- 
plication. Sections that can be skimmed or omitted for the shorter 
one- quarter curriculum are marked with a star (*). Discrete time 
processes are given more emphasis than in many texts because they 
are simpler to handle and because they are of increasing practical im- 
portance in digital systems. For example, linear filter input/output 
relations are carefully developed for discrete time; then the contin- 
uous time analogs are obtained by replacing sums with integrals. 
The mathematical details underlying the continuous time results are 
found in a starred section. 

Most examples are developed by beginning with simple processes. 
These processes are filtered or modulated to obtain more complicated 
processes. This provides many examples of typical probabilistic com- 
putations on simple processes and on the output of operations on 
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simple processes. Extra tools are introduced as needed to develop 
properties of the examples. 

The prerequisites for this book are elementary set theory, elemen- 
tary probability, and some familiarity with linear systems theory 
(Fourier analysis, convolution, discrete and continuous time linear 
filters, and transfer functions). The elementary set theory and prob- 
ability may be found, for example, in the classic text by A1 Drake [18] 
or in the current MIT basic probability text by Bertsekas and Tsit- 
siklis [3]. The Fourier and linear systems material can by found in 
numerous texts, including Gray and Goodman [30]. Some of these 
basic topics are reviewed in this book in appendix A. These results 
are considered prerequisite as the pace and density of material would 
likely be overwhelming to someone not already familiar with the fun- 
damental ideas of probability such as probability mass and density 
functions (including the more common named distributions), com- 
puting probabilities, derived distributions, random variables, and ex- 
pectation. It has long been the authors’ experience that the students 
having the most difficulty with this material are those with little or 
no experience with elementary probability. 



Organization of the Book 

Chapter 2 provides a careful development of the fundamental con- 
cept of probability theory — a probability space or experiment. The 
notions of sample space, event space, and probability measure are 
introduced and illustrated by examples. Independence and elemen- 
tary conditional probability are developed in some detail. The ideas 
of signal processing and of random variables are introduced briefly 
as functions or operations on the output of an experiment. This in 
turn allows mention of the idea of expectation at an early stage as a 
generalization of the description of probabilities by sums or integrals. 

Chapter 3 treats the theory of measurements made on experiments: 
random variables, which are scalar- valued measurements; random 
vectors, which are a vector or finite collection of measurements; and 
random processes, which can be viewed as sequences or waveforms 
of measurements. Random variables, vectors, and processes can all 
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be viewed as forms of signal processing: each operates on “inputs,” 
which are the sample points of a probability space, and produces an 
“output,” which is the resulting sample value of the random variable, 
vector, or process. These output points together constitute an output 
sample space, which inherits its own probability measure from the 
structure of the measurement and the underlying experiment. As 
a result, many of the basic properties of random variables, vectors, 
and processes follow from those of probability spaces. Probability 
distributions are introduced along with probability mass functions, 
probability density functions, and cumulative distribution functions. 
The basic derived distribution method is described and demonstrated 
by example. A wide variety of examples of random variables, vectors, 
and processes are treated. Expectations are introduced briefly as a 
means of characterizing distributions and to provide some calculus 
practice. 

Chapter 4 develops in depth the ideas of expectation — averages 
of random objects with respect to probability distributions. Also 
called probabilistic averages, statistical averages, and ensemble av- 
erages, expectations can be thought of as providing simple but im- 
portant parameters describing probability distributions. A variety 
of specific averages are considered, including mean, variance, char- 
acteristic functions, correlation, and covariance. Several examples of 
unconditional and conditional expectations and their properties and 
applications are provided. Perhaps the most important application is 
to the statement and proof of laws of large numbers or ergodic the- 
orems, which relate long term sample average behavior of random 
processes to expectations. In this chapter laws of large numbers are 
proved for simple, but important, classes of random processes. Other 
important applications of expectation arise in performing and ana- 
lyzing signal processing applications such as detecting, classifying, 
and estimating data. Minimum mean squared nonlinear and linear 
estimation of scalars and vectors is treated in some detail, showing 
the fundamental connections among conditional expectation, opti- 
mal estimation, and second order moments of random variables and 
vectors. 

Chapter 5 concentrates on the computation and applications of 
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second-order moments — the mean and covariance — of a variety 
of random processes. The primary example is a form of derived 
distribution problem: if a given random process with known second- 
order moments is put into a linear system what are the second-order 
moments of the resulting output random process? This problem is 
treated for linear systems represented by convolutions and for linear 
modulation systems. Transform techniques are shown to provide a 
simplification in the computations, much like their ordinary role in 
elementary linear systems theory. Mean square convergence is revis- 
ited and several of its applications to the analysis of continuous time 
random processes are collected under the heading of mean square 
calculus. Included are a careful definition of integration and filtering 
of random processes, differentiation of random processes, and sam- 
pling and orghogonal expansions of random processes. In all of these 
examples the behavior of the second order moments determines the 
applicability of the results. The chapter closes with a development 
of several results from the theory of linear least-squares estimation. 
This provides an example of both the computation and the applica- 
tion of second-order moments. 

In Chapter 6 a variety of useful models of sometimes complicated 
random processes are developed. A powerful approach to modeling 
complicated random processes is to consider linear systems driven by 
simple random processes. Chapter 5 used this approach to compute 
second order moments, this chapter goes beyond moments to develop 
a complete description of the output processes. To accomplish this, 
however, one must make additional assumptions on the input pro- 
cess and on the form of the linear filters. The general model of a 
linear filter driven by a memoryless process is used to develop sev- 
eral popular models of discrete time random processes. Analogous 
continuous time random process models are then developed by direct 
description of their behavior. The principle class of random processes 
considered is the class of independent increment processes, but other 
processes with similar definitions but quite different properties are 
also introduced. Among the models considered are autoregressive 
processes, moving- average processes, ARM A (autoregressive-moving 
average) processes, random walks, independent increment processes, 
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Markov processes, Poisson and Gaussian processes, and the random 
telegraph wave process. We also briefly consider an example of a 
nonlinear system where the output random processes can at least 
be partially described — the exponential function of a Gaussian or 
Poisson process which models phase or frequency modulation. We 
close with examples of a type of “doubly stochastic” process — a 
compound process formed up by adding a random number of other 
random effects. 

Appendix A sketches several prerequisite definitions and concepts 
from elementary set theory and linear systems theory using examples 
to be encountered later in the book. The first subject is crucial at 
an early stage and should be reviewed before proceeding to chapter 
2. The second subject is not required until chapter 5, but it serves 
as a reminder of material with which the student should already be 
familiar. Elementary probability is not reviewed, as our basic devel- 
opment includes elementary probability. The review of prerequisite 
material in the appendix serves to collect together some notation 
and many definitions that will be used throughout the book. It is, 
however, only a brief review and cannot serve as a substitute for a 
complete course on the material. This chapter can be given as a first 
reading assignment and either skipped or skimmed briefly in class; 
lectures can proceed from an introduction, perhaps incorporating 
some preliminary material, directly to chapter 2. 

Appendix B provides some scattered definitions and results needed 
in the book that detract from the main development, but may be of 
interest for background or detail. These fall primarily in the realm of 
calculus and range from the evaluation of common sums and integrals 
to a consideration of different definitions of integration. Many of the 
sums and integrals should be prerequisite material, but it has been 
the authors’ experience that many students have either forgotten or 
not seen many of the standard tricks. Hence several of the most im- 
portant techniques for probability and signal processing applications 
are included. Also in this appendix some background information on 
limits of double sums and the Lebesgue integral is provided. 

Appendix C collects the common univariate probability mass func- 
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tions and probability density functions along with their second order 
moments for reference. 

The book concludes with an appendix suggesting supplementary 
reading, providing occasional historical notes, and delving deeper 
into some of the technical issues raised in the book. In that section 
we assemble references on additional background material as well 
as on books that pursue the various topics in more depth or on a 
more advanced level. We feel that these comments and references 
are supplementary to the development and that less clutter results 
by putting them in a single appendix rather than strewing them 
throughout the text. The section is intended as a guide for further 
study, not as an exhaustive description of the relevant literature, the 
latter goal being beyond the authors’ interests and stamina. 

Each chapter is accompanied by a collection of problems, many 
of which have been contributed by collegues, readers, students, and 
former students. It is important when doing the problems to justify 
any “yes/no” answers. If an answer is “yes,” prove it is so. If the 
answer is “no,” provide a counterexample. 
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2.1 Introduction 

The theory of random processes is a branch of probability theory 
and probability theory is a special case of the branch of mathematics 
known as measure theory. Probability theory and measure theory 
both concentrate on functions that assign real numbers to certain sets 
in an abstract space according to certain rules. These set functions 
can be viewed as measures of the size or weight of the sets. For 
example, the precise notion of area in two-dimensional Euclidean 
space and volume in three-dimensional space are both examples of 
measures on sets. Other measures on sets in three dimensions are 
mass and weight. Observe that from elementary calculus we can find 
volume by integrating a constant over the set. From physics we can 
find mass by integrating a mass density or summing point masses over 
a set. In both cases the set is a region of three-dimensional space. 
In a similar manner, probabilities will be computed by integrals of 
densities of probability or sums of “point masses” of probability. 

Both probability theory and measure theory consider only nonneg- 
ative real- valued set functions. The value assigned by the function to 
a set is called the probability or the measure of the set, respectively. 
The basic difference between probability theory and measure theory 
is that the former considers only set functions that are normalized 
in the sense of assigning the value of 1 to the entire abstract space, 
corresponding to the intuition that the abstract space contains every 
possible outcome of an experiment and hence should happen with 



12 




2.1 Introduction 



13 



certainty or probability 1. Subsets of the space have some uncer- 
tainty and hence have probability less than 1. 

Probability theory begins with the concept of a probability space , 
which is a collection of three items: 

1. An abstract space fl, as encountered in appendix A, called a sample space , 
which contains all distinguishable elementary outcomes or results of an 
experiment. These points might be names, numbers, or complicated 
signals. 

2. An event space or sigma-field T consisting of a collection of subsets of the 
abstract space which we wish to consider as possible events and to which 
we wish to assign a probability. We require that the event space have 
an algebraic structure in the following sense: any finite or countably 
infinite sequence of set-theoretic operations (union, intersection, com- 
plementation, difference, symmetric difference) on events must produce 
other events. 

3. A probability measure P — an assignment of a number between 0 and 
1 to every event, that is, to every set in the event space. A probability 
measure must obey certain rules or axioms and will be computed by 
integrating or summing, analogously to area, volume, and mass compu- 
tations. 

This chapter is devoted to developing the ideas underlying the 
triple (f2,P, P), which is collectively called a probability space or 
an experiment. Before making these ideas precise, however, several 
comments are in order. 

First of all, it should be emphasized that a probability space is 
composed of three parts; an abstract space is only one part. Do 
not let the terminology confuse you: “space” has more than one 
usage. Having an abstract space model all possible distinguishable 
outcomes of an experiment should be an intuitive idea since it simply 
gives a precise mathematical name to an imprecise English descrip- 
tion. Since subsets of the abstract space correspond to collections 
of elementary outcomes, it should also be possible to assign prob- 
abilities to such sets. It is a little harder to see, but we can also 
argue that we should focus on the sets and not on the individual 
points when assigning probabilities since in many cases a probability 
assignment known only for points will not be very useful. For ex- 
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ample, if we spin a pointer and the outcome is known to be equally 
likely to be any number between 0 an 1, then the probability that 
any particular point such as .3781984637 or exactly 1/tt occurs is 0 
because there is an uncountable infinity of possible points, none more 
likely than the others. Hence knowing only that the probability of 
each and every point is zero, we would be hard pressed to make any 
meaningful inferences about the probabilities of other events such as 
the outcome being between 1/2 and 3/4. Writers of fiction (includ- 
ing Patrick O’Brian in his Aubrey-Maturin series) have often made 
much of the fact that extremely unlikely events often occur. One 
can say that zero probability events occur virtually all the time since 
the a priori probability that the universe will be exactly in a par- 
ticular configuration at 12:01AM Coordinated Universal Time (aka 
Greenwich Mean Time) is 0, yet the universe will indeed be in some 
configuration at that time. 

The difficulty inherent in this example leads to a less natural as- 
pect of the probability space triumvirate — the fact that we must 
specify an event space or collection of subsets of our abstract space 
to which we wish to assign probabilities. In the example it is clear 
that taking the individual points and their countable combinations is 
not enough (see also problem 2.3). On the other hand, why not just 
make the event space the class of all subsets of the abstract space? 
Why require the specification of which subsets are to be deemed suf- 
ficiently important to be blessed with the name “event”? In fact, 
this concern is one of the principal differences between elementary 
probability theory and advanced probability theory (and the point 
at which the student’s intuition frequently runs into trouble). When 
the abstract space is finite or even countably infinite, one can con- 
sider all possible subsets of the space to be events, and one can build 
a useful theory. When the abstract space is uncountably infinite, 



1 A set is countably infinite if it can be put into one-to-one correspondence 
with the nonnegative integers and hence can be counted. For example, 
the set of positive integers is countable and the set of all rational numbers 
is countable. The set of all irrational numbers and the set of all real num- 
bers are both uncountable. See appendix A for a discussion of countably 
infinite vs. uncountably infinite spaces. 
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however, as in the case of the space consisting of the real line or the 
unit interval, one cannot build a useful theory without constraining 
the subsets to which one will assign a probability. Roughly speak- 
ing, this is because probabilities of sets in uncountable spaces are 
found by integrating over sets, and some sets are simply too nasty to 
be integrated over. Although it is difficult to show, for such spaces 
there does not exist a reasonable and consistent means of assigning 
probabilities to all subsets without contradiction or without violat- 
ing desirable properties. In fact, it is so difficult to show that such 
“non-probability-measurable” subsets of the real line exist that we 
will not attempt to do so in this book. The reader should at least be 
aware of the problem so that the need for specifying an event space 
is understood. It also explains why the reader is likely to encounter 
phrases like “measurable sets” and “measurable functions” in the 
literature — some things are unmeasurable! 

Thus a probability space must make explicit not just the elemen- 
tary outcomes or “finest-grain” outcomes that constitute our abstract 
space; it must also specify the collections of sets of these points to 
which we intend to assign probabilities. Subsets of the abstract space 
that do not belong to the event space will simply not have probabili- 
ties defined. The algebraic structure that we have postulated for the 
event space will ensure that if we take (countable) unions of events 
(corresponding to a logical “or”) or intersections of events (corre- 
sponding to a logical “and”), then the resulting sets are also events 
and hence will have probabilities. In fact, this is one of the main 
functions of probability theory: given a probabilistic description of 
a collection of events, find the probability of some new event formed 
by set-theoretic operations on the given events. 

Up to this point the notion of signal processing has not been men- 
tioned. It enters at a fundamental level if one realizes that each in- 
dividual point uj G ft produced in an experiment can be viewed as a 
signal , it might be a single voltage conveying the value of a measure- 
ment, a vector of values, a sequence of values, or a waveform, any one 
of which can be interpreted as a signal measured in the environment 
or received from a remote transmitter or extracted from a physical 
medium that was previously recorded. Signal processing in general 
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is the performing of some operation on the signal. In its simplest yet 
most general form this consists of applying some function or mapping 
or operation g to the signal or input uj to produce an output g(cj), 
which might be intended to guess some hidden parameter, extract 
useful information from noise, enhance an image, or any simple or 
complicated operation intended to produce a useful outcome. If we 
have a probabilistic description of the underlying experiment, then 
we should be able to derive a probabilistic description of the outcome 
of the signal processor. This, in fact, is the core problem of derived 
distributions, one of the fundamental tools of both probability the- 
ory and signal processing. In fact, this idea of defining functions 
on probability spaces is the foundation for the definition of random 
variables, random vectors, and random processes, which will inherit 
their basic properties from the underlying probability space, thereby 
yielding new probability spaces. Much of the theory of random pro- 
cesses and signal processing consists of developing the implications 
of certain operations on probability spaces: beginning with some 
probability space we form new ones by operations called variously 
mappings, filtering, sampling, coding, communicating, estimating, 
detecting, averaging, measuring, enhancing, predicting, smoothing, 
interpolating, classifying, analyzing or other names denoting linear or 
nonlinear operations. Stochastic systems theory is the combination 
of systems theory with probability theory. The essence of stochastic 
systems theory is the connection of a system to a probability space. 
Thus a precise formulation and a good understanding of probability 
spaces are prerequisites to a precise formulation and correct devel- 
opment of examples of random processes and stochastic systems. 

Before proceeding to a careful development, several of the basic 
ideas are illustrated informally with simple examples. 



2.2 Spinning Pointers and Flipping Coins 

Many of the basic ideas at the core of this text can be introduced and 
illustrated by two very simple examples, the continuous experiment 
of spinning a pointer inside a circle and the discrete experiment of 
flipping a coin. 
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2.2 Spinning Pointers and Flipping Coins 

A Uniform Spinning Pointer 

Suppose that Nature (or perhaps Tyche, the Greek Goddess of 
chance) spins a pointer in a circle as depicted in Figure 2.1. When 




0.25 



Figure 2.1. The Spinning Pointer 



the pointer stops it can point to any number in the unit interval 
[0, 1) = {r : 0 < r < 1}. We call [0, 1) the sample space of our ex- 
periment and denote it by a capital Greek omega, Q. What can we 
say about the probabilities or chances of particular events or out- 
comes occurring as a result of this experiment? The sorts of events 
of interest are things like “the pointer points to a number between 0 
and .5” (which one would expect should have probability 0.5 if the 
wheel is indeed fair) or “the pointer does not lie between 0.75 and 
1” (which should have a probability of 0.75). Two assumptions are 
implicit here. The first is that an “outcome” of the experiment or 
an “event” to which we can assign a probability is simply a subset of 
[0, 1). The second assumption is that the probability of the pointer 
landing in any particular interval of the sample space is proportional 
to the length of the interval. This should seem reasonable if we in- 
deed believe the spinning pointer to be “fair” in the sense of not 
favoring any outcomes over any others. The bigger a region of the 
circle, the more likely the pointer is to end up in that region. We can 
formalize this by stating that for any interval [a, b\ = {r : a < r < b} 
with 0 < a < fe < 1 we have that the probability of the event “the 
pointer lands in the interval [a, 6]” is 



P([a, b\) = b — a. 



(2.1) 
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We do not have to restrict interest to intervals in order to define 
probabilities consistent with (2.1). The notion of the length of an 
interval can be made precise using calculus and simultaneously ex- 
tended to any subset of [0, 1) by defining the probability P(F) of a 
set F C [0, 1) as 

P(F) = [ f(r) dr, (2.2) 

Jf 

where /(r) = 1 for all r E [0, 1). With this definition it is clear that 
for any 0 < a < b < 1 that 

P([a,6])=/ f(r) dr = b — a. (2.3) 

J a 

We could also arrive at effectively the same model by considering 
the sample space to be the entire real line, Pi = 5ft = (— 00 , 00 ) and 
defining the pdf to be 




1 if r E [0, 1) 
0 otherwise 



(2.4) 



The integral can also be expressed without specifying limits of inte- 
gration by using the indicator function of a set 



as 



1 HO = 




if r E F 
if r 0 F 




1 F(0/( r ) d r ‘ 



(2.5) 



(2.6) 



Other implicit assumptions have been made here. The first is that 
probabilities must satisfy some consistency properties. We cannot 
arbitrarily define probabilities of distinct subsets of [0, 1) (or, more 
generally, 5ft) without regard to the implications of probabilities for 
other sets; the probabilities must be consistent with each other in 
the sense that they do not contradict each other. For example, if we 
have two formulas for computing probabilities of a common event, 
as we have with (2.1) and (2.2) for computing the probability of an 
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interval, then both formulas must give the same numerical result - 
as they do in this example. 

The second implicit assumption is that the integral exists in a well 
defined sense, that it can be evaluated using calculus. As surprising 
as it may seem to readers familiar only with typical engineering- 
oriented developments of Riemann integration, the integral of (2.2) 
is in fact not well defined for all subsets of [0, 1). But we leave this 
detail for later and assume for the moment that we only encounter 
sets for which the integral (and hence the probability) is well defined. 

The function /(r) is called a probability density function or pdf 
since it is a nonnegative point function that is integrated to compute 
total probability of a set, just as a mass density function is integrated 
over a region to compute the mass of a region in physics. Since in 
this example f{r) is constant over a region, it is called a uniform pdf. 

The formula (2.2) for computing probability has many implica- 
tions, three of which merit comment at this point. 

• Probabilities are nonnegative: 



P(F) > 0 for any F. 



(2.7) 



This follows since integrating a nonnegative argument yields a non- 
negative result. 

• The probability of the entire sample space is 1: 



P(Q) = 1. 



(2.8) 



This follows since integrating 1 over the unit interval yields 1, but it 
has the intuitive interpretation that the probability that “something 
happens” is 1. 

• The probability of the union of disjoint or mutually exclusive re- 
gions is the sum of the probabilities of the individual events: 



If F n G = 0 , then P(F U G) = P{F ) + P(G). 



(2.9) 
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This follows immediately from the properties of integration: 



P(F U G) = 



f(r) dr 



FUG 



= [ f{r)dr+ f f(r)dr 
JF JG 

= P(F) + P(G). 



An alternative proof follows by observing that since F and G are 
disjoint, 1 fug ( t ) — 1f(t) + 1 g ( v ) and hence linearity of integration 
implies that 

P{F U G) = J lFjG(r)f(r) dr 

= J (MO + 1 G(r))f(r) dr 

= J l F (r)f(r) dr + J 1 c{r)f(r)dr 
= P(F) + P(G). 

This property is often called the additivity property of probability. 
The second proof makes it clear that additivity of probability is an 
immediate result of the linearity of integration, i.e. , that the integral 
of the sum of two functions is the sum of the two integrals. 

Repeated application of additivity for two events shows that for 
any finite collection {Fy; k = 1, 2, . . . , K} of disjoint events, i.e., 
events with the property that Fyf^Fj = 0 for all k ^ j, we have 
that 

K K 

P(\jF k ) = J2 P (Fk ), (2.10) 

k = 1 k = 1 

showing that additivity is equivalent to finite additivity , the exten- 
sion of the additivity property from two to a finite collection of sets. 
Since additivity is a special case of finite additivity and it implies 
finite additivity, the two notions are equivalent and we can use them 
interchangably. 

These three properties of nonnegativity, normalization, and ad- 
ditivity are fundamental to the definition of the general notion of 
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probability and will form three of the four axioms needed for a pre- 
cise development. It is tempting to call an assignment P of numbers 
to subsets of a sample space a probability measure if it satisfies these 
three properties, but we shall see that a fourth condition, which 
is crucial for having well behaved limits and asymptotics, will be 
needed to complete the definition. Pending this fourth condition, 
(2.2) defines a probability measure. In fact, this definition is com- 
plete in the simple case where the sample space Cl has only a finite 
number of points since in that case limits and asympotics become 
trivial. A sample space together with a probability measure provide 
a mathematical model for an experiment. This model is often called 
a probability space , but for the moment we shall stick to the less 
intimidating word of experiment 



Simple Properties 

Several simple properties of probabilities can be derived from what 
we have so far. As particularly simple, but still important, examples, 
consider the following. 

Assume that P is a set function defined on a sample space Cl that 
satisfies properties (2.7 - 2.9). Then 

(a) P(F C ) = 1 - P(F) . 

(b) P(F) < 1 . 

(c) Let 0 be the null or empty set, then P(0) = 0 . 

(d) If { Fi ; z = 1,2,... ,K} is a finite partition of Cl, i.e., if Fi fl 
Fk — 0 when i ^ k and [J ?;=1 Fi = Cl, then 

K 

P(G) = Yi / P(GnF i ) (2.11) 

i = 1 



for any event G. 

Proof (a) F U F c = Cl implies P(F U F c ) = 1 (property (2.8)). 
F fl F c = 0 implies 1 = P(P U F c ) = P(P) + P(P C ) (property 
(2.9)). 

(b) P(P) = 1 — P(F C ) < 1 (property (2.7) and (a) above). 
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(c) By property (2.8) and (a) above, P(Q C ) = P(0) = 1 — P(Q) = 

0. 

(d) P(G) = P(G n ft) = P(G n ((jFi)) = P(U( G n ^)) = 

i i 

yp(Gnfi). □ 

i 

Observe that although the null or empty set 0 has probability 0, 
the converse is not true in that a set need not be empty just be- 
cause it has zero probability. In the uniform fair wheel example 
the set F = {1/n : n = 1,2,3, .. .} is not empty, but it does have 
probability zero. This follows roughly because for any finite N 
P{{l/n : n = 2, 3, ... , N}) = 0 (since the integral of 1 over a finite 
set of points is zero) and therefore the limit as N — > oo must also 
be zero, a “continuity of probability” idea that we shall later make 
rigorous. 



A Single Coin Flip 

The original example of a spinning wheel is continuous in that the 
sample space consists of a continuum of possible outcomes, all points 
in the unit interval. Sample spaces can also be discrete, as is the case 
of modeling a single flip of a “fair” coin with heads labeled “1” and 
tails labeled “0”, i.e., heads and tails are equally likely. The sample 
space in this example is Q = {0, 1} and the probability for any event 
or subset of Q can be defined in a reasonable way by 



or, equivalently, 



P(F) = 

reF 



P(F) = -£ l F {r)p(r), 



(2.12) 



(2.13) 



where now p(r) = 1/2 for each rGfl. The function p is called a 
probability mass function or pmf because it is summed over points 
to find total probability, just as point masses are summed to find 
total mass in physics. Be cautioned that P is defined for sets and p 
is defined only for points in the sample space. This can be confusing 
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when dealing with one-point or singleton sets, for example 

P({0})=p(0) 

P({l})=p(l). 



This may seem too much work for such a little example, but keep 
in mind that the goal is a formulation that will work for far more 
complicated and interesting examples. This example is different from 
the spinning wheel in that the sample space is discrete instead of 
continuous and that the probabilities of events are defined by sums 
instead of integrals, as one should expect when doing discrete math. 
It is easy to verify, however, that the basic properties (2.7)-(2.9) hold 
in this case as well (since sums behave like integrals), which in turn 
implies that the simple properties (a)— (d) also hold. 



A Single Coin Flip as Signal Processing 

The coin flip example can also be derived in a very different way that 
provides our first example of signal processing. Consider again the 
spinning pointer so that the sample space is Cl and the probability 
measure P is described by (2.2) using a uniform pdf as in (2.4). 
Performing the experiment by spinning the pointer will yield some 
real number r E [0, 1). Define a measurement q made on this outcome 
by 



f 1 if r E [0,0.5] 
[0 if r E (0.5,1) ’ 



(2.14) 



This function can also be defined somewhat more economically in 
terms of an indicator function as 



q(0 = i [0,0.5] ir) • ( 2 - 15 ) 

This is an example of a quantizer , an operation that maps a con- 
tinuous value into a discrete value. Quantization is an example of 
signal processing since it is a function or mapping defined on an input 
space, here Cl = [0, 1) or Cl = 5ft, producing a value in some output 
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space. In this example Pl g = {0, 1}. The dependence of a function 
on its input space or domain of definition Pi and its output space 
or range Pl g , is often denoted by q : Pi — > Pl g . Although introduced as 
an example of simple signal processing, the usual name for a real- 
valued function defined on the sample space of a probability space is 
a random variable. We shall see in the next chapter that there is an 
extra technical condition on functions to merit this name, but that 
is a detail that can be postponed. 

The output space Pl g can be considered as a new sample space, the 
space corresponding to the possible values seen by an observer of the 
output of the quantizer (an observer who might not have access to the 
original space). If we know both the probability measure on the input 
space and the function, then in theory we should be able to describe 
the probability measure that the output space inherits from the input 
space. Since the output space is discrete, it should be described by 
a pmf, say p q . Since there are only two points, we need only find the 
value of p q ( 1) (or p q { 0) since p q ( 0) + p q (l) = 1). An output of 1 is 
seen if and only if the input sample point lies in [0,0.5], so it follows 
easily that p q ( 1) = P([0, 0.5]) = J 0 °‘ 5 /(r), dr = 0.5, exactly the value 
assumed for the fair coin flip model. The pmf p q implies a probability 
measure on the output space Pl g by 

W = I>»’ 

where the subscript q distinguishes the probability measure P q on 
the output space from the probability measure P on the input space. 
Note that we can define any other binary quantizer corresponding to 
an “unfair” or biased coin by changing the 0.5 to some other value. 

This simple example makes several fundamental points that will 
evolve in depth in the course of this material. First, it provides 
an example of signal processing and the first example of a random 
variable , which is essentially just a mapping of one sample space 
into another. Second, it provides an example of a derived distribu- 
tion: given a probability space described by Pi and P and a function 
(random variable) q defined on this space, we have derived a new 
probability space describing the outputs of the function with sample 
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space Cl q and probability measure P q . Third, it is an example of a 
common phenomenon that quite different models can result in identi- 
cal sample spaces and probability measures. Here the coin flip could 
be modeled in a directly given fashion by just describing the sam- 
ple space and the probability measure, or it can be modeled in an 
indirect fashion as a function (signal processing, random variable) 
on another experiment. This suggests, for example, that to study 
coin flips empirically we could either actually flip a fair coin, or we 
could spin a fair wheel and quantize the output. Although the sec- 
ond method seems more complicated, it is in fact extremely common 
since most random number generators (or pseudo-random number 
generators) strive to produce random numbers with a uniform dis- 
tribution on [0, 1) and all other probability measures are produced 
by further signal processing. We have seen how to do this for a sim- 
ple coin flip. In fact any pdf or pmf can be generated in this way. 
(See problem 3.7.) The generation of uniform random numbers is 
both a science and an art. Most function roughly as follows. One 
begins with floating point number in (0, 1) called the seed, say a, 
and uses another positive floating point number, say 5, as a mul- 
tiplier. A sequence x n is then generated recursively as xo — a and 
x n = b x x n -\ mod (1) for n — 1,2, . . ., that is, the fractional part 
of b x x n -\. If the two numbers a and b are suitably chosen then 
x n should appear to be uniform. (Try it!) In fact, since there are 
only a finite number (albeit large) of possible numbers that can be 
represented on a digital computer, this algorithm must eventually 
repeat and hence x n must be a periodic sequence. As a result such a 
sequence of numbers is a pseudo-random sequence and not a genuine 
sequence of random numbers. The goal of designing a good pseudo- 
random number generater is to make the period as long as possible 
and to make the sequences produced look as much as possible like a 
random sequence in the sense that statistical tests for independence 
are fooled. If one wanted to generate a truly random generator, one 
might use some natural phenomenon such as thermal noise treated 
near the end of the book - measure the voltage across a heated re- 
sistor and let the random action of molecules in motion produce a 
random measurement. 
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It may seem strange that the axioms of probability deal with ap- 
parently abstract ideas of measures instead of corresponding phys- 
ical intuition. Physical intuition says that the probability tells you 
something about the fraction of times specific events will occur in 
a sequence of trials, such as the relative frequency of a pair of dice 
summing to seven in a sequence of many roles, or a decision algo- 
rithm correctly detecting a single binary symbol in the presence of 
noise in a transmitted data file. Such real world behavior can be 
quantified by the idea of a relative frequency, that is, suppose the 
output of the nth trial of a sequence of trials is x n and we wish to 
know the relative frequency that x n takes on a particular value, say 
a. Then given an infinite sequence of trials x = {xq,xi,X 2 , . . .} we 
could define the relative frequency of a in x by 

number of k E {0, 1, . . . , n — 1} for which xy — a 
r a\x) = lim . 

n— kx) 77, 

(2.16) 

For example, the relative frequency of heads in an infinite sequence 
of fair coin flips should be 0.5, the relative frequency of rolling a pair 
of fair dice and having the sum be 7 in an infinite sequence of rolls 
should be 1/6 since the pairs (1, 6), (6, 1), (2, 5), (5, 2), (3, 4), (4, 3) are 
equally likely and form 6 of the possible 36 pairs of outcomes. Thus 
one might suspect that to make a rigorous theory of probability re- 
quires only a rigorous definition of probabilities as such limits and 
a reaping of the resulting benefits. In fact much of the history of 
theoretical probability consisted of attempts to accomplish this, but 
unfortunately it does not work. Such limits might not exist, or they 
might exist and not converge to the same thing for different repeti- 
tions of the same experiment. Even when the limits do exist there 
is no guarantee they will behave as intuition would suggest when 
one tries to do calculus with probabilities, that is, to compute prob- 
abilities of complicated events from those of simple related events. 
Attempts to get around these problems uniformly failed and proba- 
bility was not put on a rigorous basis until the axiomatic approach 
was completed by Kolmogorov. (A discussion of some of the contri- 
butions of Kolmogorov may be found in the Kolmogorov memorial 
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issue of the Annals of Probability , Volume 17, 1989. His contribu- 
tions to information theory, a shared interest area of the authors, are 
described in [11].) The axioms do, however, capture certain intuitive 
aspects of relative frequencies. Relative frequencies are nonnegative, 
the relative frequency of the entire set of possible outcomes is one, 
and relative frequencies are additive in the sense that the relative fre- 
quency of the symbol a or the symbol b occurring, r aU b(:r), is clearly 
r a (x) + Ts(x). Kolmogorov realized that beginning with simple ax- 
ioms could lead to rigorous limiting results of the type needed, while 
there was no way to begin with the limiting results as part of the 
axioms. In fact it is the fourth axiom, a limiting version of additivity, 
that plays the key role in making the asymptotics work. 



2.3 Probability Spaces 

We now turn to a more thorough development of the ideas introduced 
in the previous section. 

A sample space Pi is an abstract space, a nonempty collection of 
points or members or elements called sample points (or elementary 
events or elementary outcomes). 

An event space (or sigma-field or sigma- algebra) T of a sample 
space Pi is a nonempty collection of subsets of Pi called events with 
the following properties: 



If F G T , then also F c G T , (2.17) 

that is, if a given set is an event, then its complement must also be an 
event. Note that any particular subset of Pi may or may not be an event 
(review the quantizer example). 



If for some finite n, Fi G F , i = 1, 2, . . . , n, then also 



U Fi€F, (2.18) 

i—1 



that is, a finite union of events must also be an event. 
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If Fi G F , i = 1, 2, . . . , then also 

oo 

U Fi G T , (2.19) 

2=1 

that is, a countable union of events must also be an event. 

We shall later see alternative ways of describing (2.19), but this 
form is the most common. 

Eq. (2.18) can be considered as a special case of (2.19) since, for 
example, given a finite collection i = 1, . . . , N, we can construct 
an infinite sequence of sets with the same union, e.g., given E&, k = 
1,2,..., iV, construct an infinite sequence G n with the same union 
by choosing G n = F n for n = 1, 2, . . . N and G n = 0 otherwise. It 
is convenient, however, to consider the finite case separately. If a 
collection of sets satisfies only (2.17) and (2.18) but not (2.19), then 
it is called a field or algebra of sets. For this reason, in elementary 
probability theory one often refers to “set algebra” or to the “algebra 
of events.” (Don’t worry about why (2.19) might not be satisfied.) 
Both (2.17) and (2.18) can be considered as “closure” properties; 
that is, an event space must be closed under complementation and 
unions in the sense that performing a sequence of complementations 
or unions of events must yield a set that is also in the collection, 
i.e., a set that is also an event. Observe also that (2.17), (2.18), and 
(A. 11) imply that 

PI G F , (2.20) 

that is, the whole sample space considered as a set must be in F\ 
that is, it must be an event. Intuitively, Pi is the “certain event,” the 
event that “something happens.” Similarly, (2.20) and (2.17) imply 
that 



DeF, ( 2 . 21 ) 

and hence the empty set must be in F, corresponding to the intuitive 
event “nothing happens.” 

A few words about the different nature of membership in Pi and 
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F is in order. If the set F is a subset of Pi, then we write F C Pi. If 
the subset F is also in the event space, then we write F G F. Thus 
we use set inclusion when considering F as a subset of an abstract 
space, and element inclusion when considering F as a member of 
the event space and hence as an event. Alternatively, the elements 
of Pi are points, and a collection of these points is a subset of Pi; 
but the elements of F are sets — subsets of Pi, — and not points. 
A student should ponder the different natures of abstract spaces of 
points and event spaces consisting of sets until the reasons for set 
inclusion in the former and element inclusion in the latter space are 
clear. Consider especially the difference between an element of PI 
and a subset of Pi that consists of a single point. The latter might 
or might not be an element of F, the former is never an element of 
F . Although the difference might seem to be merely semantics, the 
difference is important and should be thoroughly understood. 

A measurable space (Pl,F) is a pair consisting of a sample space Pi 
and an event space or sigma-field F of subsets of PI. The strange name 
“measurable space” reflects the fact that we can assign a measure 
such as a probability measure, to such a space and thereby form a 
probability space or probability measure space. 

A probability measure P on a measurable space (f2, F) is an assign- 
ment of a real number P(F) to every member F of the sigma-field 
(that is, to every event) such that P obeys the following rules, which 
we refer to as the axioms of probability. 

Axiom 2.1 



P(F) > 0 for all F eF (2.22) 

i.e., no event has negative probability. 

Axiom 2.2 



P(Pl) = 1 



(2.23) 



i.e., the probability of “ everything ” is one. 
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Axiom 2.3 If i^, i = 1, 2, . . . , n are disjoint , then 

( n \ n 

U F * • ( 2 - 24 ) 

2=1 / 2=1 

Axiom 2.4 If i^, i = 1,2, . . . are disjoint, then 

( oo \ oo 

Ur =E p ( F *) • ( 2 - 25 ) 

2=1 / 2=1 

Note that nothing has been said to the effect that probabilities 
must be sums or integrals, but the first three axioms should be rec- 
ognizable from the three basic properties of nonnegativity, normaliza- 
tion, and additivity encountered in the simple examples introduced 
in the introduction to this chapter where probabilities were defined 
by an integral of a pdf over a set or a sum of a pmf over a set. The 
axioms capture these properties in a general form and will be seen to 
include more general constructions, including multidimensional in- 
tegrals and combinations of integrals and sums. The fourth axiom 
can be viewed as an extra technical condition that must be included 
in order to get various limits to behave. Just as property (2.19) of 
an event space will later be seen to have an alternative statement in 
terms of limits of sets, the fourth axiom of probability, axiom 2.4, 
will be shown to have an alternative form in terms of explicit limits, 
a form providing an important continuity property of probability. 
Also as in the event space properties, the fourth axiom implies the 
third. 

As with the defining properties of an event space, for the purposes 
of discussion we have listed separately the finite special case (2.24) 
of the general condition (2.25). The finite special case is all that is 
required for elementary discrete probability. The general condition 
is required to get a useful theory for continuous probability. A good 
way to think of these conditions is that they essentially describe 
probability measures as set functions defined by either summing or 
integrating over sets, or by some combination thereof. Hence much 
of probability theory is simply calculus, especially the evaluation of 
sums and integrals. 

To emphasize an important point: a function P which assigns num- 
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bers to elements of an event space of a sample space is a probability 
measure if and only if it satisfies all of the four axioms! 

A probability space or experiment is a triple (Pl,F, P) consisting of 
a sample space Pi, an event space T of subsets of Pi, and a probability 
measure P defined for all members of T . 

Before developing each idea in more detail and providing several 
examples of each piece of a probability space, we pause to consider 
two simple examples of the complete construction. The first example 
is the simplest possible probability space and is commonly referred to 
as the trivial probability space. Although useless for application, the 
model does serve a purpose, however, by showing that a well-defined 
model need not be interesting. The second example is essentially the 
simplest nontrivial probability space, a slight generalization of the 
fair coin flip permitting an unfair coin. 



Examples 

[2.0] Let Pi be any abstract space and let T — {ST, 0} ; that is, T 
consists of exactly two sets — the sample space (everything) and 
the empty set (nothing). This is called the trivial event space. 
This is a model of an experiment where only two events are possi- 
ble: “Something happens” or “nothing happens” — not a very in- 
teresting description. There is only one possible probability mea- 
sure for this measurable space: P(Pl) = 1 and P(0) = 0. (Why?) 
This probability measure meets the required rules that define a 
probability measure; they can be directly verified since there are 
only two possible events. Equations (2.22) and (2.23) are obvi- 
ous. Equations (2.24) and (2.25) follow since the only possible 
values for Fi are PI and 0. At most one of the F{ can be Pi. If one 
of the Fi is Pi, then both sides of the equality are 1. Otherwise, 
both sides are 0. 

[2.1] Let Pi = {0, 1}. Let F = {{0}, {1}, PI = {0, 1}, 0}. Since T 
contains all of the subsets of PI, the properties (2.17) through 
(2.19) are trivially satisfied, and hence it is an event space. (There 
is one other possible event space that could be defined for PI in 
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this example. What is it?) Define the set function P by 



P(F) 



{ 1 - p if F = { 0} 

p if F = {1} 

0 if F = 0 

1 if F = ft , 



where p G [0, 1] is a fixed parameter. (If p = 0 or p = 1 the space 
becomes trivial.) It is easily verified that P satisfies the axioms 
of probability and hence is a probability measure. Therefore 
(Pl,F, P) is a probability space. Note that we had to give the 
value of P(F) for all events T, a construction that would clearly 
be absurd for large sample spaces. Note also that the choice of 
P(F) is not unique for the given measurable space (Pl,F); we 
could have chosen any value in [0,1] for P({1}) and used the 
axioms to complete the definition. 



The preceding example is the simplest nontrivial example of a 
probability space and provides a rigorous mathematical model for 
applications such as the binary transmission of a single bit or for 
the flipping of a single biased coin once. It therefore provides a 
complete and rigorous mathematical model for the single coin flip of 
the introduction. 

We now develop in more detail properties and examples of the 
three components of probability spaces: sample spaces, event spaces, 
and probability measures. 



2.3.1 Sample Spaces 

Intuitively, a sample space is a listing of all conceivable finest-grain, 
distinguishable outcomes of an experiment to be modeled by a prob- 
ability space. Mathematically it is just an abstract space. 

Examples 

[2.2] A finite space Pi = {a^; k — 1,2,..., K}. Specific examples 

are the binary space {0, 1} and the finite space of integers Z & = 
{0, 1,2, — ,k— 1}. 

[2.3] A countably infinite space Pi = {a^; k = 0, 1, 2, . . .}, for some 
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sequence {a^}. Specific examples are the space of all nonnegative 
integers {0, 1, 2, . . .}, which we denote by i? + , and the space of all 
integers {. . . , —2, —1, 0, 1,2,...}, which we denote by Z. Other 
examples are the space of all rational numbers, the space of all 
even integers, and the space of all periodic sequences of integers. 

Both examples [2.2] and [2.3] are called discrete spaces. Spaces 
with finite or countably infinite numbers of elements are called dis- 
crete spaces. 

[2.4] An interval of the real line 5ft, for example, Cl = (a, b). We 
might consider an open interval (a, 6), a closed interval [a, 6], a 
half-open interval [a, 6) or (a, 6], or even the entire real line 5ft 
itself. (See appendix A for details on these different types of 
intervals.) 

Spaces such as example [2.4] that are not discrete are said to be 
continuous. In some cases it is more accurate to think of spaces 
as being a mixture of discrete and continuous parts, e.g., the space 
Cl = (1,2) U {4} consisting of a continuous interval and an isolated 
point. Such spaces can usually be handled by treating the discrete 
and continuous components separately. 

[2.5] A space consisting of k — dimensional vectors with coordinates 
taking values in one of the previously described spaces. A useful 
notation for such vector spaces is a product space. Let A de- 
note one of the abstract spaces previously considered. Define the 
Cartesian product A k by 

A k = { all vectors a = (ao, aq, . . . , a&_ i) with G A} . 

Thus, for example, 5 R k is k — dimensional Euclidean space. {0, l} k 
is the space of all binary /c— tuples, that is, the space of all 
/c— dimensional binary vectors. As particular examples, {0, l} 2 = 
{00,01,10,11} and {0, l} 3 = {000, 001, 010, 011, 100, 101, 110, 111}. 
[0, l] 2 is the unit square in the plane. [0, 1] 3 is the unit cube in 
three-dimensional Euclidean space. 
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Alternative notations for a Cartesian product space are 

k - 1 

P Ai = P Ai i, 

i=0 

where again the Ai are all replicas or copies of A , that is, where 
Ai = A, all i. Other notations for such a finite-dimensional Cartesian 
product are 

Xiez k Ai = xf-gA; = A k . 

This and other product spaces will prove to be a useful means of 
describing abstract spaces which model sequences of elements from 
another abstract space. 

Observe that a finite-dimensional vector space constructed from a 
discrete space is also discrete since if one can count the number of 
possible values one coordinate can assume, then one can count the 
number of possible values that a finite number of coordinates can 
assume. 

[2.6] A space consisting of infinite sequences drawn from one of 
the examples [2.2] through [2.4]. Points in this space are often 
called discrete time signals. This is also a product space. Let A 
be a sample space and let Ai be replicas or copies of A. We will 
consider both one-sided and two-sided infinite products to model 
sequences with and without a finite origin, respectively. Define 
the two-sided space 

Y[ Ai = { all sequences {ap, i = . . . , —1, 0, 1, . . .}; ai E Ai} , 
iez 

and the one-sided space 

JJ Ai = { all sequences { a e i — 0, 1, . . .}; ai E Ai} . 
iez + 

These two spaces are also denoted by n^-oo^ or x i^=- oo^ an d 
or respectively. 

The two spaces under discussion are often called sequence spaces. 
Even if the original space A is discrete, the sequence space con- 
structed from A will be continuous. For example, suppose that 
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Ai = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} for all integers i. Then x°8 0 Tl* is the 
space of all semiinfinite (one-sided) decimal sequences, which is 
equivalent to the space of all real numbers in the unit interval [0, 1). 
This follows since if to G then uj = • • .)> which can be 

written as .ujqlo\uj 2 . . ., which can represent any real number in the 
unit interval by the decimal expansion YlpLo This space 

contains the decimal representations of all of the real numbers in the 
unit interval, an uncountable infinity of numbers. Similarly, there 
is an uncountable infinity of one-sided binary sequences because one 
can express all points in the unit interval in the binary number sys- 
tem as sequences to the right of the “decimal” point (problem A.ll). 

[2.7] Let A be one of the sample spaces of examples [2.2] through 
[2.4]. Form a new abstract space consisting of all waveforms or 
functions of time with values in A, for example, all real-valued 
time functions or continuous time signals. This space is also 
modeled as a product space. For example, the infinite two-sided 
space for a given A is 

At = {all waveforms {x(t); t G (— oo,oc)}; x(t) G A, allt}, 

text 

with a similar definition for one-sided spaces and for time func- 
tions on a finite time interval. 

Note that we indexed sequences (discrete time signals) using sub- 
scripts, as in x n , and we indexed waveforms (continuous time sig- 
nals) using parentheses, as in x{t). In fact, the notations are in- 
terchangeable; we could denote waveforms as {#(£); t G 5ft} or as 
{ xt ; t G 5ft}. The notation using subscripts for sequences and paren- 
theses for waveforms is the most common, and we will usually stick 
to it. Yet another notation for discrete time signals is x[n], a com- 
mon notation in the digital signal processing literature. It is worth 
remembering that vectors, sequences, and waveforms are all just in- 
dexed collections of numbers; the only difference is the index set: 
finite for vectors, countably infinite for sequences, and continuous 
for waveforms. 
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* General Product Spaces 

All of the product spaces we have described can be viewed as special 
cases of the general product space defined next. 

Let T be an index set such as a finite set of integers 2&, the set of 
all integers 2, the set of all nonnegative integers 2+, the real line 5ft, 
or the nonnegative reals [0, oo). Given a family of spaces {A t ; t G 2}, 
define the product space 

A ^ A{ = { all {at; t G 2}; at G At, all t} . 

tel 

The notation x te jA t is also used for the same thing. Thus product 
spaces model spaces of vectors, sequences, and waveforms whose co- 
ordinate values are drawn from some fixed space. This leads to two 
notations for the space of all k— dimensional vectors with coordinates 
in A : A k and A Zk . This shorter and simpler notation is usually more 
convenient. 



2.3.2 Event Spaces 

Intuitively, an event space is a collection of subsets of the sample 
space or groupings of elementary events which we shall consider as 
physical events and to which we wish to assign probabilities. Math- 
ematically, an event space is a collection of subsets that is closed 
under certain set-theoretic operations; that is, performing certain 
operations on events or members of the event space must give other 
events. Thus, for example, if in the example of a single voltage mea- 
surement we have Q = 5ft and we are told that the set of all voltages 
greater than 5 volts = {uj : uj > 5} is an event, that is, it is a member 
of a sigma-field T of subsets of 5ft, then necessarily its complement 
{uj : uj < 5} must also be an event, that is, a member of the sigma- 
field T . If the latter set is not in T then T cannot be an event space! 
Observe that no problem arises if the complement physically cannot 
happen — events that “cannot occur” can be included in T and then 
assigned probability zero when choosing the probability measure P. 
For example, even if you know that the voltage does not exceed 5 
volts, if you have chosen the real line 5ft as your sample space, then 
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you must include the set {r : r > 5} in the event space if the set 
{r : r < 5} is an event. The impossibility of a voltage greater than 
5 is then expressed by assigning P({r : r > 5}) = 0. 

While the definition of a sigma-field requires only that the class be 
closed under complementation and countable unions, these require- 
ments immediately yield additional closure properties. The count- 
ably infinite version of DeMorgan’s “laws” of elementary set theory 
(see appendix A) require that if Fi, i — 1, 2, . . . are all members of a 
sigma- field, then so is 

oo / oo 

o.= u*? 

1=1 \i= 1 

It follows by similar set-theoretic arguments that any countable 
sequence of any of the set-theoretic operations (union, intersec- 
tion, complementation, difference, symmetric difference) performed 
on events must yield other events. Observe, however, that there 
is no guarantee that uncountable operations on events will produce 
new events; they may or may not. For example, if we are told that 
{ F r ; r G [0, 1]} is a family of events, then it is not necessarily true 
that Ure[o i]^r *5 is an event (see problem 2.3 for an example). 

The requirement that a finite sequence of set-theoretic operations 
on events yields other events is an intuitive necessity and is easy to 
verify for a given collection of subsets of an abstract space: It is 
intuitively necessary that logical combinations ( and and or and not ) 
of events corresponding to physical phenomena should also be events 
to which a probability can be assigned. If you know the probability 
of a voltage being greater than zero and you know the probability 
that the voltage is not greater than 5 volts, then you should also be 
able to determine the probability that the voltage is greater than zero 
but not greater than 5 volts. It is easy to verify that finite sequences 
of set-theoretic combinations yield events because the finiteness of 
elementary set theory usually yields simple proofs. 

A natural question arises in regard to (2.17) and (2.18): Why not 
try to construct a useful probability theory on the more general no- 
tion of a field rather than a sigma-field? The response is that it 
unfortunately does not work. Probability theory requires many re- 
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suits involving limits, and such asymptotic results require the infinite 
relations of (2.19) and (2.25). In some special cases, such as single 
coin flipping or single die rolling, the simpler finite results suffice be- 
cause there are only a finite number of possible outcomes, and hence 
limiting results become trivial — any finite field is automatically a 
sigma- field. If, however, one flips a coin forever, then there is an 
uncountable infinity of possible outcomes, and the asymptotic rela- 
tions become necessary. Let be the space of all one-sided binary 
sequences. Suppose that you consider the smallest field formed by 
all finite set-theoretic operations on the individual one-sided binary 
sequences, that is, on singleton sets in the sequence space. Then 
many countably infinite sets of binary sequences (say the set of all 
periodic sequences) are not events since they cannot be expressed as 
finite sequences of set-theoretic operations on the singleton sets. Ob- 
viously, the sigma-field formed by including countable set-theoretic 
operations does not have this defect. This is why sigma- fields must 
be used rather than fields. 



Limits of Sets 

The condition (2.19) can be related to a condition on limits by defin- 
ing the notion of a limit of a sequence of sets. This notion will 
prove useful when interpreting the axioms of probability. Consider 
a sequence F n ,n — 1,2, . . ., of sets with the property that each set 
contains its predecessor, that is, that F n - 1 C F n for all n. Such a 
sequence of sets is said to be nested and increasing. For example, 
the sequence F n = [1, 2 — 1/n) of subsets of the real line is increasing. 
The sequence (— n, a) is also increasing. Intuitively, the first example 
increases to a limit of [1,2) in the sense that every point in the set 
[1, 2) is eventually included in one of the F n . Similarly, the sequence 
in the second example increases to (— oo,a). Formally, the limit of 
an increasing sequence of sets can be defined as the union of all of 
the sets in the sequence since the union contains all of the points in 
all of the sets in the sequence and does not contain any points not 
contained in at least one set (and hence an infinite number of sets) 
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in the sequence: 

oo 

lim F n = M F n . 

n— »oo v - >f 

n— 1 

Figure 2.2. (a) illustrates such a sequence in a Venn diagram. 





Figure 2.2. (a) Increasing sets, (b) decreasing sets 



Thus the limit of the sequence of sets [1,2 — 1/n) is indeed the set 
[1,2), as desired, and the limit of (— n,a) is (oo,a). If F is the limit 
of a sequence of increasing sets F n , then we write F n j F. 

Similarly, suppose that F n ; n = 1, 2, . . . is a decreasing sequence of 
nested sets in the sense that F n C F n _ i for all n as illustrated by the 
Venn diagram in Figure 2.2(b). For example, the sequences of sets 
[1, 1 + 1/n) and (1 — 1/n, 1 + 1/n) are decreasing. Again we have a 
natural notion of the limit of this sequence: Both these sequences 
of sets collapse to the point of singleton set {1} — the point in 
common to all the sets. This suggests a formal definition based on 
the countably infinite intersection of the sets. 

Given a decreasing sequence of sets F n ; n = 1, 2, . . . , we define the 
limit of the sequence by 

oo 

lim F n = P| F n , 

n— >oo 1 1 

n= 1 

that is, a point is in the limit of a decreasing sequence of sets if and 
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only if it is contained in all the sets of the sequence. If F is the limit 
of a sequence of decreasing sets F n , then we write F n j F. 

Thus, given a sequence of increasing or decreasing sets, the limit 
of the sequence can be defined in a natural way: the union of the 
sets of the sequence or the intersection of the sets of the sequence, 
respectively. 

Say that we have a sigma-field T and an increasing sequence of 
sets F n ; n — 1,2,... of sets in the sigma- field. Since the limit of the 
sequence is defined as a union and since the union of a countable 
number of events must be an event, then the limit must be an event. 
For example, if we are told that the sets [1,2 — 1/n) are all events, 
then the limit [1,2) must also be an event. If we are told that all 
finite intervals of the form (a, 6), where a and b are finite, are events, 
then the semi-infinite interval (— oo,6) must also be an event, since 
it is the limit of the sequence of sets (— n, b ) and n — > oc. 

By a similar argument, if we are told that each set in a decreasing 
sequence F n is an event, then the limit must be an event, since it is 
an intersection of a countable number of events. Thus, for example, 
if we are told that all finite intervals of the form (a, b) are events, 
then the points of singleton sets must also be events, since a point 
{a} is the limit of the decreasing sequence of sets (a — 1/n, a + 1/n). 

If a class of sets is only a field rather than a sigma-field, that is, if 
it satisfies only (2.17) and (2.18), then there is no guarantee that the 
class will contain all limits of sets. Hence, for example, knowing that 
a class of sets contains all half-open intervals of the form (a, b\ for a 
and b finite does not ensure that it will also contain points or singleton 
sets! In fact, it is straightforward to show that the collection of all 
such half-open intervals together with the complements of such sets 
and all finite unions of the intervals and complements forms a field. 
The singleton sets, however, are not in the field! (See problem 2.6.) 

Thus if we tried to construct a probability theory based on only 
a field, we might have probabilities defined for events such as (a, b) 
meaning “the output voltage of a measurement is between a and 6” 
and yet not have probabilities defined for a singleton set {a} meaning 
“the output voltage is exactly a.” By requiring that the event space 
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be a sigma- field instead of only a field, we are assured that all such 
limits are indeed events. 

It is a straightforward exercise to show that given (2.17) and (2.18), 
property (2.19) is equivalent to either of the following: 

If F n G n — 1,2,..., is a decreasing sequence or an increasing 
sequence, then 

lim F n ef. (2.26) 

n— xx) 

We have already seen that (2.19) implies (2.26). For example, if 
(2.26) is true and G n is an arbitrary sequence of events, then define 
the increasing sequence 

n 

F n = IJ Gi . 

i= 1 

Obviously F n _ i C F n , and then (2.26) implies (2.19), since 

OO CO 

II Gi = M F n = lim F n € T . 

v -^ n—*co 

i = 1 n= 1 

Examples 

As we have noted, for a given sample space the selection of an event 
space is not unique; it depends on the events to which it is desired to 
assign probabilities and also on analytical limitations on the ability 
to assign probabilities. We begin with two examples that represent 
the extremes of event spaces — one possessing the minimum quantity 
of sets and the other possessing the maximum. We then study event 
spaces useful for the sample space examples of the preceding section. 

[2.8] Given a sample space then the collection {fi, 0} is a sigma- 
field. This is just the trivial event space already treated in exam- 
ple [2.0]. Observe again that this is the smallest possible event 
space for any given sample space because no other event space 
can have fewer elements. 

[2.9] Given a sample space then the collection of all subsets of 
Pi is a sigma-field. This is true since any countable sequence of 
set-theoretic operations on subsets of Pi must yield another subset 
of Pi and hence must be in the collection of all possible subsets. 
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The collection of all subsets of a space is called the power set of 
the space. Observe that this is the largest possible event space for 
the given sample space, because it contains every possible subset 
of the sample space. 

This sigma-field is a useful event space for the sample spaces of 
examples [2.2] and [2.3], that is, for sample spaces that are dis- 
crete. We shall always take our event space as the power set when 
dealing with a discrete sample space (except possibly for a few per- 
verse homework problems). A discrete sample space with n elements 
has a power set with 2 n elements (problem 2.5). For example, the 
power set of the binary sample space ft = {0,1} is the collection 
{{0}, {1}, Pi = {0, 1}, 0}, a list of all four possible subsets of the space. 

Unfortunately, the power set is too large to be useful for contin- 
uous spaces. To treat the reasons for this is beyond the scope of a 
book at this level, but we can say that it is not possible in general 
to construct interesting probability measures on the power set of a 
continuous space. There are special cases where we can construct 
particular probability measures on the power set of a continuous 
space by mimicking the construction for a discrete space (see, e.g., 
problems 2.5, 2.7, and 2.10). Truly continuous experiments cannot, 
however, be rigorously defined for such a large event space because 
integrals cannot be defined over all events in such spaces. 

While both of the preceding examples can be used to provide event 
spaces for the special case of Pi = 5ft, the real line, neither leads to 
a useful probability theory in that case. In the next example we 
consider another event space for the real line that is more useful and, 
in fact, is used almost always for 5ft and higher dimensional Euclidean 
spaces. First, however, we need to treat the idea of generating an 
event space from a collection of important events. Intuitively, given a 
collection of important sets Q that we require to be events, the event 
space c(Q) generated by Q is the smallest event space T to which 
all the sets in Q belong. That is, cr{G) is an event space, it contains 
all the sets in G, and no smaller collection of sets satisfies these two 
conditions. 

Regardless of the details, it is worth emphasizing the key points of 
this discussion. 
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• The notion of a generated sigma-field allows one to describe an event 
space for the real line, called the Borel field, that contains all physically 
important events and which will lead to a useful calculus of probability. 
It is usually not important to understand the detailed structure of this 
event space past the facts that it 

— is indeed an event space, and 

— it contains all the important events such as intervals. 

• The notion of a generated sigma-field can be used to extend the event 
space of the real line to event spaces of vectors, sequences, and wave- 
forms taking on real values. Again the detailed structure is usually not 
important past the fact that it 

— is indeed an event space, and 

— it contains all the important events such as those described by requiring 
any finite collection of coordinate values to lie within intervals. 

* Generating Event Spaces 

Any useful event space for the real line should include as members all 
intervals of the form (a, b) since we certainly wish to consider events 
of the form “the output voltage is between 3 and 5 volts.” Further- 
more, we obviously require that the event space satisfy the defining 
properties for an event space, that is, that we have a collection of 
subsets of Pi that satisfy properties (2.17) through (2.19). A means 
of accomplishing both of these goals in a relatively simple fashion is 
to define our event space as the smallest sigma-field that contains 
the desired subsets, to wit, the intervals and all of their countable 
set-theoretic combinations (bewildering as it may seem, this is not 
the same as all subsets of 5ft). Of course, although a sigma-field that 
is based on the intervals is most useful, it is also possible to consider 
other starting points. These considerations motivate the following 
general definition. 

Given a sample space Pi (such as the real line 5ft) and an arbitrary 
class Q of subsets of Pi — usually the class of all open intervals of the 
form (a, b) when Pi = 5ft — define cr((/), the sigma-field generated by 
the class Q , to be the smallest sigma-field containing all of the sets in 
Q, where by “smallest” we mean that if T is any sigma-field and it 
contains Q , then it contains cr(Q). (See any book on measure theory, 
e.g., Ash [1].) 
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For example, as noted before, we might require that a sigma-field 
of the real line contain all intervals; then it would also have to con- 
tain at least all complements of intervals and all countable unions 
and intersections of intervals and all countable complements, unions, 
and intersections of these results, ad infinitum. This technique will 
be used several times to specify useful event spaces in complicated 
situations such as continuous simple spaces, sequence spaces, and 
function spaces. We are now ready to provide the proper, most use- 
ful event space for the real line. 

[2.10] Given the real line 5ft, the Borel field (or, more accurately, 
the Borel sigma-field) is defined as the sigma-field generated by 
all the open intervals of the form (a, b). The members of the 
Borel field are called Borel sets. We shall denote the Borel field 
by B{ 5ft), and hence 

£>(5ft) = a ( all open intervals ) . 

Since B(5ft) is a sigma- field and since it contains all of the open 
intervals, it must also consider limit sets of the form 

(— oo, b) = lim (— n, b) , 

71— ► OO 

(a, oo) = lim (a, n) , 

n— >oo 

and 



{a} = lim (a — 1/n, a + 1/n) , 

n—>oo 

that is, the Borel field must include semi-infinite open intervals and 
the singleton sets or individual points. Furthermore, since the Borel 
field is a sigma- field it must contain differences. Hence it must con- 
tain semi-infinite half-open sets of the form 

(—oc,6] = (— oo,oc ) — (6, oo) , 

and since it must contain unions of its members, it must contain 
half-open intervals of the form 

(a, b } = (a, b ) U {b} and [a, b ) = (a, b ) U {a} . 

In addition, it must contain all closed intervals and all finite or count- 
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able unions and complements of intervals of any of the preceding 
forms. Roughly speaking, the Borel field contains all subsets of the 
real line that can be constructed by countable sequences of opera- 
tions on intervals. It is a deep and difficult result of measure theory 
that the Borel field of the real line is in fact different from the power 
set of the real line; that is, there exist subsets of the real line that are 
not in the Borel field. While we will not describe such a subset, we 
can guarantee that these “unmeasurable” sets have no physical im- 
portance, that they are very hard to construct, and that an engineer 
will never encounter such a subset in practice. It may, however, be 
necessary to demonstrate that some weird subset is in fact an event 
in this sigma- field. This is typically accomplished by showing that it 
is the limit of simple Borel sets. 

In some cases we wish to deal not with a sample space that is the 
entire real line, but one that is some subset of the real line. In this 
case we define the Borel field as the Borel field of the real line “cut 
down” to the smaller space. 

Given that the sample space, is a Borel subset of the real line 
5ft, the Borel field of denoted B(Q), is defined as the collection of 
all sets of the form F fl for F G B( 5ft); that is, the intersection of 
Q with all of the Borel sets of 5ft forms the class of Borel sets of Q. 

It can be shown (problem 2.4) that, given a discrete subset A of 
the real line, the Borel field B(A) is identical to the power set of 
A. Thus, for the first three examples of sample spaces, the Borel 
field serves as a useful event space since it reduces to the intuitively 
appealing class of all subsets of the sample space. 

The remaining examples of sample spaces are all product spaces. 
The construction of event spaces for such product spaces — that 
is, spaces of vectors, sequences, or waveforms — is more compli- 
cated and less intuitive than the constructions for the preceding event 
spaces. In fact, there are several possible techniques of construction, 
which in some cases lead to different event spaces. We wish to convey 
an understanding of the structure of such event spaces, but we do not 
wish to dwell on the technical difficulties that can be encountered. 
Hence we shall study only one of the possible constructions — the 
simplest possible definition of a product sigma-field — by making a 
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direct analogy to a product sample space. This definition will suf- 
fice for most systems studied herein, but it has shortcomings. At 
this time we mention one particular weakness: The event space that 
we shall define may not be big enough when studying the theory of 
continuous time random processes. 

[2.11] Given an abstract space A, a sigma- field T of subsets of A , 
an index set Z , and a product sample space of the form 

A x = ]jA t , 
tel 

where the A t are all replicas of A , the product sigma-field 

r T = H ft , 

tel 

is defined as the sigma-field generated by all “one-dimensional” 
sets of the form 

{{ a t ; t G Z} : a t G F for t — s and a t G A t for t ^ s} 

for some s G Z and some F G T\ that is, the product sigma- field is 
the sigma-field generated by all “one-dimensional” events formed 
by collecting all of the vectors or sequences or waveforms with 
one coordinate constrained to lie in a one-dimensional event and 
with the other coordinates unrestricted. The product sigma-field 
must contain all such events; that is, for all possible indices s and 
all possible events F. 

Thus, for example, given the one-dimensional abstract space 5ft, 
the real line along with its Borel field, Figure 2.3 (a)-(c) depicts 
three examples of one-dimensional sets in 5ft 2 , the two-dimensional 
Euclidean plane. Note, for example, that the unit circle {(x, y) : x 2 + 
y 2 < 1} is not a one-dimensional set since it requires simultaneous 
constraints on two coordinates. 

More generally, for a fixed finite k the product sigma-field B( 5ft) 

(or simply of k — dimensional Euclidean space 5 R k is the small- 

est sigma-field containing all one-dimensional events of the form 

{x = (x 0 , Xi, . . . , Xfe_i) : Xi G F} 
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Figure 2.3. Upper left: {(xo>#i) • xq £ (1,3)}, Up- 

per right{(xo, xi) : x\ G (3,6)}, Lower left{(xo,xi) : x\ G 
(4, 5) U (— oo, —2)}, Lower right {(xo, xi) : xo G (1,3); x\ G (3, 6)}, 
One- and two-dimensional events in two-dimensional space. 



for some i = 0,l,...,fc — 1 and some Borel set F of 5ft. The two- 
dimensional example Figure 2.3(a) has this form with k = 2,z = 0, 
and F — (1,3). This one-dimensional set consists of all values in the 
infinite rectangle between 1 and 3 in the xo direction and between 
— ex) and oo in the x\ direction. 

To summarize, we have defined a space A with event space F, and 
an index set X such as 5ft, or [0,1), and we have formed the 

product space A 1 and the associated product event space F x . We 
know that this event space contains all one-dimensional events by 
construction. We next consider what other events must be in F x by 
virtue of its being an event space. 
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After the one-dimensional events that pin down the value of a sin- 
gle coordinate of the vector or sequence or waveform, the next most 
general kinds of events are finite-dimensional sets that separately pin 
down the values at a finite number of coordinates. Let /C be a finite 
collection of members of I and hence /C C X. Say that JC has K 
members, which we shall denote as z = 0,1,... , XX — 1 } . These 

K numbers can be thought of as a collection of sample times such 
as {1,4,8,156,1027} for a sequence or {1.5,9.07,40.0,41.2,41.3} for 
a waveform. We assume for convenience that the sample times are 
ordered in increasing fashion. Let {F k .\ i = 0, 1, . . . , K — 1} be a col- 
lection of members of T . Then a set of the form 

{{ X t ; tel}: x ki G F ki ; i = 0, 1, . . . , K - 1} 

is an example of a finite-dimensional set. Note that it collects all 
sequences or waveforms such that a finite number of coordinates are 
constrained to lie in one-dimensional events. An example of two- 
dimensional sets of this form in two-dimensional space is illustrated 
in Figure 2.3(d). Observe that when the one-dimensional sets con- 
straining the coordinates are intervals, then the two-dimensional sets 
are rectangles. Analogous to the two-dimensional example, finite- 
dimensional events having separate constraints on each coordinate 
are called rectangles. Observe, for example, that a circle or sphere in 
Euclidean space is not a rectangle because it cannot be defined using 
separate constraints on the coordinates; the constraints on each co- 
ordinate depend on the values of the others — e.g., in two dimensions 
for a circle of radius one we require that Xq < 1 — x\. 

Note that Figure 2.3(d) is just the intersection of examples (a) 
and (b) of Figure 2.3. In general we can express finite-dimensional 
rectangles as intersections of one-dimensional events as 

{{ x t ; tel}: x ki e F ki ; i = 0, 1, . . . , K — 1} 

K - 1 

= H {{^5 tel}: x ki e F x } , 

i = 0 

that is, a set constraining a finite number of coordinates to each 
lie in one-dimensional events or sets in T is the intersection of a 
collection of one-dimensional events. Since is a sigma-field and 
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since it contains the one-dimensional events, it must contain such 
finite intersections, and hence it must contain such finite-dimensional 
events. 

By concentrating on events that can be represented as the finite 
intersection of one-dimensional events we do not mean to imply that 
all events in the product event space can be represented in this fash- 
ion — the event space will also contain all possible limits of finite 
unions of such rectangles, complements of such sets, and so on. For 
example, the unit circle in two dimensions is not a rectangle, but it 
can be considered as a limit of unions of rectangles and hence is in 
the event space generated by the rectangles. (See problem 2.36.) 

The moral of this discussion is that the product sigma-field for 
spaces of sequences and waveforms must contain (but not consist 
exclusively of) all sets that are described by requiring that the out- 
puts of coordinates for a finite number of events lie in sets in the 
one-dimensional event space T . 

We shall further explore such product event spaces when consid- 
ering random processes, but the key points remain 

1. a product event space is a sigma- field, and 

2. it contains all “one-dimensional events” consisting of subsets of the prod- 
uct sample space formed by grouping together all vectors or sequences 
or waveforms having a single fixed coordinate lying in a one-dimensional 
event. In addition, it contains all rectangles or finite-dimensional events 
consisting of all vectors or sequences or waveforms having a finite number 
of coordinates constrained to lie in one-dimensional events. 



2.3.3 Probability Measures 

The defining axioms of a probability measure as given in equations 
(2.22) through (2.25) correspond generally to intuitive notions, at 
least for the first three properties. The first property requires that 
a probability be a nonnegative number. In a purely mathematical 
sense, this is an arbitrary restriction, but it is in accord with the long 
history of intuitive and combinatorial developments of probability. 
Probability measures share this property with other measures such 
as area, volume, weight, and mass. 
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The second defining property corresponds to the notion that the 
probability that something will happen or that an experiment will 
produce one of its possible outcomes is one. This, too, is mathe- 
matically arbitrary but is a convenient and historical assumption. 
(From childhood we learn about things that are “100% certain” 
obviously we could as easily take 1 or 7 r (but not infinity — why?) 
to represent certainty.) 

The third property, “additivity” or “finite additivity,” is the key 
one. In English it reads that the probability of occurrence of a finite 
collection of events having no points in common must be the sum 
of the probabilities of the separate events. More generally, the basic 
assumption of measure theory is that any measure — probabilistic 
or not — such as weight, volume, mass, and area should be additive: 
the mass of a group of disjoint regions of matter should be the sum 
of the separate masses; the weight of a group of objects should be the 
sum of the individual weights. Equation (2.24) only pins down this 
property for finite collections of events. The additional restriction 
of (2.25), called countable additivity , is a limiting or asymptotic or 
infinite version, analogous to (2.19) for set algebra. This again leads 
to the rhetorical questions of why the more complicated, more re- 
strictive, and less intuitive infinite version is required. In fact, it was 
the addition of this limiting property that provided the fundamental 
idea for Kolmogorov’s development of modern probability theory in 
the 1930s. 

The response to the rhetorical question is essentially the same 
as that for the asymptotic set algebra property: Countably infinite 
properties are required to handle asymptotic and limiting results. 
Such results are crucial because we often need to evaluate the proba- 
bilities of complicated events that can only be represented as a limit 
of simple events. (This is analogous to the way that integrals are 
obtained as limits of finite sums.) 

Note that it is countable additivity that is required. Uncountable 
additivity cannot be defined sensibly. This is easily seen in terms 
of the fair wheel mentioned at the beginning of the chapter. If the 
wheel is spun, any particular number has probability zero. On the 
other hand, the probability of the event made up of all of the un- 




2.3 Probability Spaces 



51 



countable numbers between 0 and 1 is obviously one. If you consider 
defining the probability of all the numbers between 0 and 1 to be the 
uncountable sum of the individual probabilities, you see immediately 
the essential contradiction that results. 

Since countable additivity has been added to the axioms proposed 
in the introduction, the formula (2.11) used to compute probabili- 
ties of events broken up by a partition immediately extends to parti- 
tions with a countable number of elements; that is, if F*.; k — 1,2,... 
forms a partition of 0 into disjoint events (F n D Fy = 0 if n ^ k and 
IJ^, Fk = 12), then for any event G 



P(G) = Y,P(GnF k ). (2.27) 

k = 1 



Limits of Probabilities 

At times we are interested in finding the probability of the limit of 
a sequence of events. To relate the countable additivity property 
of (2.25) to limiting properties, recall the discussion of the limiting 
properties of events given earlier in this chapter in terms of increas- 
ing and decreasing sequences of events. Say we have an increasing 
sequence of events F n ; n — 0, 1, 2, ... , F n _i C F n , and let F denote 
the limit set, that is, the union of all of the F n . We have already 
argued that the limit set F is itself an event. Intuitively, since the 
F n converge to F, the probabilities of the F n should converge to the 
probability of F. Such convergence is called a continuity property of 
probability and is very useful for evaluating the probabilities of com- 
plicated events as the limit of a sequence of probabilities of simpler 
events. We shall show that countable additivity implies such conti- 
nuity. To accomplish this, define the sequence of sets Go = F$ and 
G n = F n — F n - 1 for n — 1, 2, . . . . The G n are disjoint and have the 
same union as do the F n (see Figure 2.2(a) as a visual aid). Thus we 
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have from countable additivity that 



oo 



oo 



P Qim Fn) = P MJfJ = P MJ G * 

\k = 0 / \k = 0 

n 



oo 



^F(G fc ) = n lim ^P(G fc ) , 



/c=0 



fc =0 



where the last step simply uses the definition of an infinite sum. 
Since G n = F n - F n - 1 and C F n , P(G n ) = P(F n ) - P(P„_i) 
and hence 



n 



n 



£ P ( G *) = + E - p (^-l)) = P ( F n)- 

k = 0 fc=l 



This is an example of what is called a “telescoping sum” where each 
term cancels the previous term and adds a new piece, i.e., 



P(F n ) = P(F n ) - P(F n _ i) 

+ P{F n _ i) - P(F n _ 2 ) 
+ P(P„_ 2 ) - P(Pn-s) 



+ P(Fi) - P(Fq) 

+ P(F 0 ). 

Combining these results completes the proof of the following state- 
ment. 

If F n is a sequence of increasing events, then 

P ( lim F n ) = lim P(P n ) , (2.28) 

that is, the probability of the limit of a sequence of increasing 
events is the limit of the probabilities. 

Note that the sequence of probabilities on the right-hand side of 
(2.28) is increasing with increasing n. Thus, for example, probabili- 
ties of semi- infinite intervals can be found as a limit as P((— oo, a]) = 
lim n ^oo P((— n, a}). A similar argument can be used to show that one 
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can also interchange the limit with the probability measure given a 
sequence of decreasing events; that is, 

If F n is a sequence of decreasing events, then 

P ( lim F n ) = lim P(F n ) . (2.29) 

\n — »oo / n — mdo 

that is, the probability of the limit of a sequence of decreasing 
events is the limit of the probabilities. 

Note that the sequence of probabilities on the right-hand side of 
(2.29) is decreasing with increasing n. Thus, for example, the proba- 
bilities of points can be found as a limit of probabilities of intervals, 
P({a}) = lim n _>oo P((a - 1/n, a + 1/n)). 

It can be shown (see problem 2.21) that, given (2.22) through 
(2.24), the three conditions (2.25), (2.28), and (2.29) are equivalent; 
that is, any of the three could serve as the fourth axiom of probability. 

Property (2.28) is called continuity from below , and (2.29) is called 
continuity from above. The designations “from below” and “from 
above” relate to the direction from which the respective sequences 
of probabilities approach their limit. These continuity results are 
the basis for using integral calculus to compute probabilities, since 
integrals can be expressed as limits of sums. 



2.4 Discrete Probability Spaces 

We now provide several examples of probability measures on our ex- 
amples of sample spaces and sigma-fields and thereby give complete 
examples of probability spaces. 

The first example formalizes the description of a probability mea- 
sure as a sum of a pmf as introduced in the introductory section. 

[2.12] Let Pi be a finite set and let T be the power set of Pi. Suppose 
that we have a function p{uf) that assigns a real number to each 
sample point uj in such a way that 



p{uj) > 0 , all uj G PI 



(2.30) 
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X>( w ) = 1 - (2-31) 

L 

Define the set function P by 

P(F) = £ p(w) = ]T 1 f(w)p(w) , all F E T (2.32) 

uj^F (jj£z£2 

where If(^) is the indicator function of the set F, 1 if uj E T and 

0 otherwise. 

For simplicity we drop the uj E D underneath the sum; that is, 
when no range of summation is explicit, it should be assumed the 
sum is over all possible values. Thus we can abbreviate (2.32) to 

P{F) = yy fMpM , all F G T (2.33) 

P is easily verified to be a probability measure: It obviously sat- 
isfies axioms 2.1 and 2.2. It is finitely and countably additive from 
the properties of sums. In particular, given a sequence of disjoint 
events, only a finite number can be distinct (since the power set of 
a finite space has only a finite number of members). To be disjoint, 
the balance of the sequence must equal 0. The probability of the 
union of these sets will be the finite sum of the p(cj) over the points 
in the union. This equals the sum of the probabilities of the sets in 
the sequence. Example [2.1] is a special case of example [2.12], as is 
the coin flip example of the introductary section. 

The summation (2.33) used to define probability measures for a 
discrete space is a special case of a more general weighted sum, which 
we pause to define and consider. Suppose that g is a real-valued 
function defined on D, i.e., g : Cl — > 5ft assigns a real number g(uS) 
to every uj E D. We could consider more general complex- valued 
functions, but for the moment it is simpler to stick to real- valued 
functions. Also, we could consider mappings of D into subsets of 
!R, but it is convenient for the moment to let the range of g be the 
entire real line. Recall that in the introductory section we considered 
such a function to be an example of signal processing and called it 
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a random variable. Given a pmf p, define the expectation of g (with 
respect to p) as 

E{g) = ^2g(u)p(uj). (2.34) 

With this definition (2.33) with gfio) = 1 f(^) yields 

P(F) = E( 1 F ), (2.35) 

showing that the probability of an event is the expectation of the 
indicator function of the event. Mathematically, we can think of 
expectation as a generalization of the idea of probability since prob- 
ability is the special case of expectation that results when the only 
functions allowed are indicator functions. 

Expectations are also called probabilistic averages or statistical av- 
erages. For the time being, probabilities are the most important 
examples of expectation. We shall see many examples, however, so 
it is worthwhile to mention a few of the most important. Suppose 
that the sample space is a subset of the real line, e.g., Z or Z n . One 
of the most commonly encountered expectations is the mean or first 
moment 

m = E wpM, (2-36) 

where g{ cu) = a;, the identity function. A more general idea is the 
kth moment defined by 

™ (fc) = I>) fc pM, (2.37) 

so that m = After the mean, the most commonly encountered 

moment in practice is the second moment, 

= ^(c o) 2 p((jj). (2.38) 

Moments can be thought of as parameters describing a pmf, and some 
computations involving signal processing will turn out to depend only 
on certain moments. 

2 This is not in fact the fundamental definition of expectation that will be 
introduced in chapter 4, but it will be seen to be equivalent 
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A slight variation on k order moments is the so-called centralized 
moments formed by substracting the mean before taking the power: 

Xp - m) k p(w), (2.39) 

but the only such moment commonly encountered in practice is the 
variance 

a 2 = — m) 2 p{oj). (2.40) 

The variance and the second moment are easily related as 

p 2 = XA “ m) 2 p(u >) 

= X(w 2 — 2 iom + m 2 )p{io) 

= x w 2 p(w) - 2 m x wpm + m 2 x P(v) 

= m ® — 2 m 2 + m 2 = m ® — m 2 . (2-41) 



Probability Mass Functions 

A function p{u) satisfying (2.30) and (2.31) is called a probability 
mass function or pmf. It is important to observe that the probability 
mass function is defined only for points in the sample space, while 
a probability measure is defined for events, sets which belong to an 
event space. Intuitively, the probability of a set is given by the sum 
of the probabilities of the points as given by the pmf. Obviously it is 
much easier to describe the probability function than the probability 
measure since it need only be specified for points. The axioms of 
probability then guarantee that the probability function can be used 
to compute the probability measure. Note that given one probability 
form, we can always determine the other. In particular, given the 
pmf p, we can construct P using (2.32). Given P, we can find the 
corresponding pmf p from the formula 

pH = ^(M) • 

We list below several of the most common examples of pmf’s. 
The reader should verify that they are all indeed valid pmf’s, that 
is, that they satisfy (2.30) and (2.31). 
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The binary pmf. Pi = {0, 1}; p(0) = 1 — p, p( 1) = p, where p is a 
parameter in (0,1). 

A uniform pmf. PI = Z n = {0, 1, . . . , n — 1} and p(k) = 1/n; k G 
Zn- 

The binomial pmf. = ^n+i — {0, 1, . . . , n} and 

P(k)= (l)p k (l-p) n ~ k -,k€Z n+1 , 

where 

f n\ _ n\ 

\ k ) k\(n — k)\ 

is the binomial coefficient (read as “n choose fc”). 

The binary pmf is a probability model for coin flipping with a 
biased coin or for a single sample of a binary data stream. A uniform 
pmf on Z§ can model the roll of a fair die. Observe that it would 
not be a good model for ASCII data since, for example, the letters t 
and e and the symbol for space have a higher probability than other 
letters. The binomial pmf is a probability model for the number of 
heads in n successive independent flips of a biased coin, as will later 
be seen. 

The same construction provides a probability measure on count- 
ably infinite spaces such as Z and Z+. It is no longer as simple to 
prove countable additivity, but it should be fairly obvious that it 
holds and, at any rate, it follows from standard results in elementary 
analysis for convergent series. Hence we shall only state the following 
example without proving countable additivity, but bear in mind that 
it follows from the properties of infinite summations. 

[2.13] Let Pi be a space with a countably infinite number of el- 
ements and let T be the power set of PI. Then if p(co); uj E Pi 
satisfies (2.30) and (2.31), the set function P defined by (2.32) is 
a probability measure. 

Two common examples of pmf’s on countably infinite sample 
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spaces follow. The reader should test their validity. 

The geometric pmf. Pi = {1, 2, 3, . . .} and p{k) = (1 — p) k k = 
1,2,..., where p E (0, 1) is a parameter. 

The Poisson pmf. Pi = Z+ = {0, 1,2,.. .} and p{k) = (X k e x )/k\, 
where A is a parameter in (0, oo). (Keep in mind that 0! = 1.) 

We will later see the origins of several of these pmf’s and their 
applications. For example, both the binomial and the geometric 
pmf will be derived from the simple binary pmf model for flipping a 
single coin. For the moment they should be considered as common 
important examples. Various properties of these pmf’s and a variety 
of calculations involving them are explored in the problems at the 
end of the chapter. 



Computational Examples 

The various named pmf’s provide examples for computing probabil- 
ities and other expectations. Although much of this is prerequisite 
material, it does not hurt to collect several of the more useful tricks 
that arise in evaluating sums. The binary pmf is too simple to alone 
provide much interest, so first consider the uniform pmf on Z n . This 
is trivially a valid pmf since it is nonnegative and sums to 1. The 
probability of any set is simply 

P(F) = i^l F (u;)=#n, 
n n 

where #(F) denotes the number of elements or points in the set F. 
The mean is given by 



n 



m = 






n + 1 



k = l 



2 






(2.42) 
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a standard formula easily verified by induction, as detailed in ap- 
pendix B. The second moment is given by 

m< 2 > = ('»+l)P"+l) (2 .43) 

k = l 

as can also be verified by induction. The variance can be found by 
combining (2.43), (2.42), and (2.41). 

The binomial pmf is more complicated. The first issue is to prove 
that it sums to one and hence is a valid pmf (it is obviously nonneg- 
ative). This is accomplished by recalling the binomial theorem from 
high school algebra: 

(a + b) n = J2( 1 k) an}/l ~ k (2.44) 

k = 0 ' ' 

and setting a = p and b = 1 — p to write 

n n / 

e»«=e(I 

k = 0 k = 0 ' 

= (p + 1 -p) n = i. 



p k (i-p) 



n—k 



Finding moments is trickier here, and we shall later develop a much 
easier way to do this using exponential transforms. Nonetheless, 
it provides useful practice to compute an example sum, if only to 
demonstrate later how much work can be avoided! Finding the mean 
requires evaluation of the sum 



n 



m = 



k 



n ! 



k = 0 

n 

E 

k = 0 

n 

E 



n—k 



k ( i \n—k 



(n — k)\(k — 1)! 



p (l - P ) 



n\ 



k( i _ \n—k 



k 



— ' (n — k)\(k — 1)! 



1-P) 



The trick here is to recognize that the sum looks very much like the 
terms in the binomial theorem, but a change of variables is needed 
to get the binomial theorem to simplify things. Changing variables 
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by defining l = k — 1, the sum becomes 



m = 



n— 1 

E 

1=0 



n ! 



(n-l-l)U! 



p 1+1 (i-p ) 



n—l— 1 



which will very much resemble the binomial theorem with n — 1 re- 
placing n if we factor out a p and an n: 



n—l 



nn — 



np^2 



1=0 



tn 11 ! pi (A _ p \n-l-l 



— np(p + 1 — p) n 1 = up. 



(2.45) 



The second moment is messier, so its evaluation is postponed until 
simpler means are developed. 

The geometric pmf is handled using the geometric progression, 
usually treated in high school algebra and summarized in appendix B. 
From (B.4) in appendix B we have for any real a with \a\ < 1 



oo 



E “* 

k = 0 



1 

1 — a’ 



(2.46) 



which proves that the geometric pmf indeed sums to 1. 

Evaluation of the mean of the geometric pmf requires evaluation 
of the sum 



oo oo 

m = ^ kp(k) ^ kp{ 1 — p) k ~ l . 

k = 1 k = 1 

One may have access to a book of tables which includes this sum, 
but a useful trick can be used to evaluate the sum from the well- 
known result for summing a geometric series. The trick involves 
differentiating the usual geometric progression sum, as detailed in 
appendix B, where it is shown for any q G (0, 1) that 

(X) 

J2 k q k - 1 = 

k = 0 



(1 ~ q ) 2 ' 



(2.47) 
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Set q = 1 — p in (2.47) and the formula for m yields 

m = — . (2.48) 

P 

A similar idea works for the second moment. From (B.7) of ap- 
pendix B the second moment is given by 

°° O 1 

m (2) = W k 2 p(l -p) k ~ L = p{—7 + — ) (2.49) 

£r : [ P 3 P 2 



and hence from (2.41) the variance is 




(2.50) 



As an example of a probability computation using a geometric pmf, 
suppose that P) is a discrete probability space with Pi = i? + , 

T the power set of and P the probability measure induced by the 
geometric pmf with parameter p. Find the probabilities of the event 
F = {k : k > 10} and G = {k : k is odd }. Alternatively note that 
F = {10, 11, 12, . . .} and G = {1, 3, 5, 7, . . .} (we consider only odd 
numbers in the sample space, that is, only positive odd numbers). 
We have that 



oo 

p(f) = 5>(*o= 

keF k = 10 



P 

1 — p 



E a-?)* 

k=10 



P 

1 — p 



(i-p ) 10 E(i-p) fc_1 ° 

k = 10 



= p(i - p) 9 - = i 1 - ^) 9 ’ 

fc =0 



where the suitable form of the geometric progression has been de- 
rived from the basic form (B.4). While we have concentrated on the 
calculus, this problem could be interpreted as a solution to a word 
problem. For example, suppose you arrive at the Post Office and 
you know that the probability of k people being in line is a geo- 
metric distribution with p = 1/2. What is the probability that there 
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are at least ten people in line? From the solution just obtained the 
answer is (1 — .5) 9 = 2 -9 . 

To find the probability of an odd outcome, we proceed in the same 
general fashion to write 

P(G) = £>(*)= £ p(i-v) k ~ l 

keG k= 1,3,... 

oo 

= P (! -p) k =i 3 X ][( 1 ~p) 2 } k 

k= 0,2,4,... k= 0 

= P = 1 
l-(l-p) 2 2 -p 

Thus in the English example of the post office lines, the probability 
of finding an odd number of people in line is 2/3. 

Lastly we consider the Poisson pmf, again beginning with a verifi- 
cation that it is indeed a pmf. Consider the sum 



oo 



oo 



= E 



k„-\ 



A*e 



k = 0 



k = 0 



k\ 




A^ 
k\ ' 



Here the trick is to recognize the sum as the Taylor series expansion 
for an exponential, that is, 




oo 



E 

k = 0 



A^ 

k\ ’ 



whence 



Y jP (k) = e~ x e x = 1, 

k = 0 



proving the claim. 

To evaluate the mean of the Poisson pmf, begin with 



k=0 



00 \k „- A 

E 

k = 1 



J2 k P(k) = J2 k ~ TT = e A 



OO 



E 



\ k 






k\ 
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Change variables l = k — 1 and pull a A out of the sum to write 



OO OO , l 

Y J kp{k) = Ae~ A ]T-. 
k = 0 /=0 

A 



Recognizing the sum as e , this yields 

m — A. 



(2.51) 



The second moment is found similarly, but with more bookkeeping. 
Analogous to the mean computation, 



00 ' k„-\ 00 



m 



(2) = E fc; 






X k r ^ 



k\ 



J2 k ( k ~ 1 )^r + m 



k = 1 fc=2 

°° \ k —\ 

\ K e A 



E 

k=2 



(k- 2)! 



+ m. 



Change variables l = k — 2 and pull A 2 out of the sum to obtain 

00 w- A 



m (2) ^2 — b m = A 2 + A 

^ l! 



1=0 



(2.52) 



so that from (2.41) the variance is 

a 2 = A. 



(2.53) 



Multidimensional pmf’s 

While the foregoing ideas were developed for scalar sample spaces 
such as they also apply to vector sample spaces. For exam- 
ple, if A is a discrete space, then so is the vector space A k = 
{all vectors x = (#o, • • • %k-i) with Xi G A, i = 0, 1, . . . , k — 1}. A 
common example of a pmf on vectors is the product pmf of the 
following example. 

[2.15] The product pmf. 

Let {pi; i — 0,l,...,fc — 1}, be a collection of one-dimensional 
pmf’s; that is, for each i — 0, 1, . . . , k — 1 Pi(r); r E A satisfies 
(2.30) and (2.31). Define the product fc— dimensional pmf p on 
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A k by 

k—1 

p(x) = p(x 0 , Xi, . . . , Xfc_i) = Y[ Pi(Xi) . 

As a more specific example, suppose that all of the marginal pmf ’s 
are the same and are given by a Bernoulli pmf: 

p(x) = p x ( 1 — x = 0, 1. 

Then the corresponding product pmf for a k dimensional vector be- 
comes 

k—l 

p(x o,xi,...,Xfc_i) = ]4p Xi (l 

i = 0 

— pW7(x 0 ,a;i,...,a; fe _i)^ _ ^k-w(x 0 ,x 1 ,...,x k _ 1 ) , 

where w(x o, xi, . . . , x^-i) is the number of ones occurring in the bi- 
nary /c-tuple xq, xi, . . . , Xfc_i, the Hamming weight of the vector. 



2.5 Continuous Probability Spaces 

Continuous spaces are handled in a manner analogous to discrete 
spaces, but with some fundamental differences. The primary dif- 
ference is that usually probabilities are computed by integrating a 
density function instead of summing a mass function. The good 
news is that most formulas look the same with integrals replacing 
sums. The bad news is that there are some underlying theoretical 
issues that require consideration. The problem is that integrals are 
themselves limits, and limits do not always exist in the sense of con- 
verging to a finite number. Because of this, some care will be needed 
to clarify when the resulting probabilities are well defined. 

[2.14] Let (fl, T) = (5ft, 23(5ft)), the real line together with its Borel 
field. Suppose that we have a real- valued function / on the real 
line that satisfies the following properties 



f(r) > 0 , all r E Pi . 



(2.54) 
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f{r)dr = 1 , 



(2.55) 



that is, the function f(r) has a well-defined integral over the real 
line. Define the set function P by 



P(F) = J f(r ) dr = J 1 F (r)f(r) dr , F € B(9t) . (2.56) 

We note that a probability space defined as a probability measure 
on a Borel field is an example of a Borel space. 

Again as in the discrete case, this integral is a special case of 
a more general weighted integral: Suppose that g is a real-valued 
function defined on D, i.e., g : Cl — > 5ft assigns a real number g(r) to 
every r E Cl. Recall that such a function is called a random variable. 
Given a pdf /, define the expectation of g (with respect to /) as 




g(r)f(r)dr. 



With this definition we can rewrite (2.56) as 



P(F) = E(1 F ), 



(2.57) 



(2.58) 



which has exactly the same form as in the discrete case. Thus prob- 
abilities can be considered as expectations of indicator functions in 
both the discrete case where the probability measure is described by 
a pmf and in the continuous case the probability measure is described 
by a pdf. 

As in the discrete case, there are several particularly important 
examples of expectations if the sample space is a subset of the real 
line, e.g., 5ft or [0,1). The definitions are exact integral analogs of 
those for the discrete cases: the mean or first moment 

m = f rf(r) dr , (2.59) 



the kth moment 
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including the second moment 




(2.61) 



the centralized moments formed by substracting the mean before 
taking the power: 

J (r — rrt) k f(r) dr , (2.62) 

including the variance 

a 2 = J\r- m ) 2 f(r) dr. (2.63) 

When more general complex- valued random variables are considered, 
often the kth absolute moment is is used instead: 

— J \r\ k f(r)dr. (2.64) 

As in the discrete case, the variance and the second moment are 
easily related as 

a 2 m — m 2 . (2.65) 



An important technical detail not yet considered is whether or not 
the set function defined as an integral over a pdf is actually a prob- 
ability measure. In particular, are the probabilities of all events well 
defined and do they satisfy the axioms of probability? Intuitively 
this should be the case since (2.54) to (2.56) are the integral analogs 
of the summations of (2.30) to (2.32) and we have argued that sum- 
ming pint’s provides a well-defined probability measure. In fact, this 
is a mathematically delicate issue which leads to the reasons behind 
the requirements for sigma- fields and Borel fields. Before exploring 
these issues in more depth in the next section, the easy portion of 
the answer should be recalled: We have already argued in the intro- 
duction to this chapter that if we define a set function P(F) as the 
integral of a pdf over the set F, then if the integral exists for the sets 
in question, the set function must be nonnegative, normalized, and 
additive, that is, it must satisfy the first three axioms of probability. 
This is well and good, but it leaves some key points unanswered. 
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First, is the candidate probability measure defined for all Borel sets? 
Equivalently, are we guaranteed that the integral will make sense 
for all sets (events) of interest? Second, is the candidate probabil- 
ity measure also countably additive or, equivalently, continuous from 
above or below? The answer to both questions is unfortunately no 
if one considers the integral to be a Riemann integral, the integral 
most engineers learn as undergraduates. The integral is not certain 
to exist for all Borel sets, even if the pdf is a simple uniform pdf. Rie- 
mann integrals in general do not have nice limiting properties, so the 
necessary continuity properties do not hold in general for Riemann 
integrals. These delicate issues are considered next in an optional 
subsection and further in appendix B. The bottom line can be easily 
summarized as follows. 

• Eq. (2.56) defines a probability measure on the Borel space of the real line 
and its Borel sets provided that the integral is interpreted as a Lebesgue 
integral. In all practical cases of interest, the Lebesgue integral is either 
equal to the Riemann integral, usually more familiar to engineers, or to 
a limit of Riemann integrals on a converging sequence of sets. 



★ Probabilities as Integrals 

The first issue is fundamental: Does the integral of (2.56) make sense; 
i.e., is it well-defined for all events of interest? Suppose first that we 
take the common engineering approach and use Riemann integration 
the form of integration used in elementary calculus. Then the 
above integrals are defined at least for events F that are intervals. 
This implies from the linearity properties of Riemann integration 
that the integrals are also well-defined for events F that are finite 
unions of intervals. It is not difficult, however, to construct sets F 
for which the indicator function I/? is so nasty that the function 
/(r)l J p(r) does not have a Riemann integral. For example, suppose 
that /(r) is 1 for r G [0, 1] and 0 otherwise. Then the Riemann in- 
tegral f 1 f{v) f{r) dr is not defined for the set F of all irrational 
numbers, yet intuition should suggest that the set has probability 1. 
This intuition reflects the fact that if all points are somehow equally 
probable, then since the unit interval contains an uncountable in- 
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finity of irrational numbers and only a countable infinity of rational 
numbers, then the probability of the former set should be one and 
that of the latter 0. This intuition is not reflected in the integral def- 
inition, which is not defined for either set by the Riemann approach. 
Thus the definition of (2.56) has a basic problem: The integral in 
the formula giving the probability measure of a set might not be 
well-defined. 

A natural approach to escaping this dilemma would be to use the 
Riemann integral when possible, i.e., to define the probabilities of 
events that are finite unions of intervals, and then to obtain the 
probabilities of more complicated events by expressing them as a 
limit of finite unions of intervals, if the limit makes sense. This would 
hopefully give us a reasonable definition of a probability measure on 
a class of events much larger than the class of all finite unions of 
intervals. Intuitively, it should give us a probability measure of all 
sets that can be expressed as increasing or decreasing limits of finite 
unions of intervals. 

This larger class is, in fact, the Borel field, but the Riemann in- 
tegral has the unfortunate property that in general we cannot in- 
terchange limits and integration; that is, the limit of a sequence of 
integrals of converging functions may not be itself an integral of a 
limiting function. 

This problem is so important to the development of a rigorous 
probability theory that it merits additional emphasis: even though 
the familiar Riemann integrals of elementary calculus suffice for most 
engineering and computational purposes, they are too weak for build- 
ing a useful theory, proving theorems, and evaluating the probabil- 
ities of some events which can be most easily expressed as limits of 
simple events. The problem is that the Riemann integral does not 
exist for sufficiently general functions and that limits and integration 
cannot be interchanged in general. 

The solution is to use a different definition of integration — the 
Lebesgue integral. Here we need only concern ourselves with a few 
simple properties of the Lebesgue integral, which are summarized 
below. The interested reader is referred to appendix B for a brief 
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summary of basic definitions and properties of the Lebesgue integral 
which reinforce the following remarks. 

The Riemann integral of a function f(r) “carves up” or partitions 
the domain of the argument r and effectively considers weighted sums 
of the values of the function /(r) as the partition becomes ever finer. 
Conversely, the Lebesgue integral “carves up” the values of the func- 
tion itself and effectively defines an integral as a limit of simple in- 
tegrals of quantized versions of the function. This simple change 
of definition results in two fundamentally important properties of 
Lebesgue integrals that are not possessed by Riemann integrals: 



1. The integral is defined for all Borel sets. 

2. Subject to suitable technical conditions (such as requiring that the inte- 
grands have bounded absolute value), one can interchange the order of 
limits and integration; e.g., if F n f F, then 



P(F) = [l F (r)f(r)dr= [ lim 1 Fn (r)f(r)dr 

J J n — >oo 

= lim [l Fn (r)f(r)dr = lim P(F n ) , 

n — >oo J n — »oo 

that is, (2.28) holds, and hence the set function is continuous from below. 



We have already seen that if the integral exists, then (2.56) ensures 
that the first three axioms hold. Thus the existence of the Lebesgue 
integral on all Borel sets coupled with continuity and the first three 
axioms ensures that a set function defined in this way is indeed a 
probability measure. We observe in passing that even if we confined 
interest to events for which the Riemann integral made sense, it 
would not follow that the resulting probability measure would be 
countably additive: As with continuity, these asymptotic properties 
hold for Lebesgue integration but not for Riemann integration. 

How do we reconcile the use of a Lebesgue integral given the as- 
sumed prerequisite of traditional engineering calculus courses based 
on the Riemann integral? Here a standard result of real analysis 
comes to our aid: If the ordinary Riemann integral exists over a fi- 
nite interval, then so does the Lebesgue integral, and the two are the 
same. If the Riemann integral does not exist, then we can try to find 
the probability as a limit of probabilities of simple events for which 
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the Riemann integrals do exist, e.g., as the limit of probabilities of 
finite unions of intervals. It is possible for the improper Riemann 
integral to exist, i.e., a Riemann integral over infinite limits, and 
yet have the Lebesgue integral not be well defined. For example, 
fo° sm ^ dx = ^ exists when evaluated as a Riemann improper inte- 
gral but it does not exists as a Lebesgue integral. Riemann calculus 
will usually suffice for computation (at least if f(r) is Riemann in- 
tegrable over a finite interval) provided we realize that we may have 
to take limits of Riemann integrals for complicated events. Observe, 
for example, that in the case mentioned where f(r) is 1 on [0, 1], the 
probability of a single point 1/2 can now be found easily as a limit 
of Riemann integrals: 



P 





dr 



( 1 / 2 - 6 , 1 / 2 + 6 ) 



lim 2e = 0 , 
6-^0 



as expected. 

In summary, our engineering compromise is this: We must real- 
ize that for the theory to be valid and for (2.56) indeed to give a 
probability measure on subsets of the real line, the integral must be 
interpreted as a Lebesgue integral and Riemann integrals may not 
exist. For computation, however, one will almost always be able to 
find probabilities by either Riemann integration or by taking limits 
of Riemann integrals over simple events. This distinction between 
Riemann integrals for computation and Lebesgue integrals for the- 
ory is analogous to the distinction between rational numbers and real 
numbers. Computational and engineering tasks use only arithmetic 
of finite precision in practice. However, in developing the theory ir- 
rational numbers such as y/2 and 7r are essential. Imagine how hard 
it would be to develop a theory without using irrational numbers, 
and how unwise it would be to do so just because the eventual com- 
putations do not use them. So it is with Lebesgue integrals. 



Probability Density Functions 

The function / used in (2.54) to (2.56) is called a probability density 
function or pdf since it is a nonnegative function that is integrated 
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to find a total mass of probability, just as a mass density function 
in physics is integrated to find a total mass. Like a pmf, a pdf is 
defined only for points in Cl and not for sets. Unlike a pmf, a pdf is 
not in itself the probability of anything; for example, a pdf can take 
on values greater than one, while a pmf cannot. Under a pdf, points 
frequently have probability zero, even though the pdf is nonzero. We 
can, however, interpret a pdf as being proportional to a probability in 
the following sense. For a pmf we had p(x) = P({x}). Suppose now 
that the sample space is the real line and that a pdf / is defined. 
Let F — [x, x + Ax), where Ax is extremely small. Then if / is 
sufficiently smooth, the mean value theorem of calculus implies that 

rx+Ax 

P([x,x + Ax)) = / f(a)da^f(x) Ax, (2.66) 

J X 

Thus if a pdf /(x) is multiplied by a differential Ax, it can be in- 
terpreted as approximately the probability of being within Ax of 
x. 

Both probability functions, the pmf and the pdf, can be used to 
define and compute a probability measure: The pmf is summed over 
all points in the event, and the pdf is integrated over all points in 
the event. If the sample space is a subset of the real line, both can 
be used to compute expectations such as moments. 

Some of the most common pdf’s are listed below. As will be seen, 
these are indeed valid pdf’s, that is, they satisfy (2.54) and (2.55). 
The pdf’s are assumed to be 0 outside of the specified domain. 
6, a, A > 0, m, and a > 0 are parameters in !R. 

The uniform pdf: Given b > a, /(r) = 1/(6 — a) for r E [a, 6]. 

The exponential pdf: /(r) = Ae ~ Ar ; r > 0. 

The doubly exponential (or Laplacian) pdf: 

/(r) = ^ e~ A l r l; r E 5ft. 

The Gaussian (or Normal) pdf: 

/(r) = ( 27T(j 2 )~ exp( r E 5ft. Since the density is com- 

pletely described by two parameters: the mean m and variance 
a 2 > 0, it is common to denote it by A f(m, a 2 ). 

Other univariate pdf’s may be found in appendix C. 
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Just as we used a pdf to construct a probability measure on the 
space (5ft,£>(5ft)), we can also use it to define a probability measure 
on any smaller space (A, 13(A)), where A is a subset of 5ft. 

As a technical detail we note that to ensure that the integrals all 
behave as expected we must also require that A itself be a Borel 
set of 5ft so that it is precluded from being too nasty a set. Such 
probability spaces can be considered to have a sample space of either 
5ft or A , as convenient. In the former case events outside of A will 
have zero probability. 



Computational Examples 

This section is less detailed than its counterpart for discrete prob- 
ability because generally engineers are more familiar with common 
integrals than with common sums. We confine the discussion to a 
few observations and to an example of a multidimensional probability 
computation. 

The uniform pdf is trivially a valid pdf because it is nonnegative 
and its integral is simply the length of the interval on which it is 
nonzero, b — a, divided by the length. For simplicity consider the case 
where a = 0 and b = 1 so that b — a = 1. In this case the probability 
of any interval within [0, 1) is simply the length of the interval. The 
mean is easily found to be 




the second moment is 




and the variance is 





1 

12 



(2.67) 



( 2 . 68 ) 



(2.69) 



The validation of the pdf and the mean, second moment, and vari- 
ance of the exponential pdf can be found from integral tables or by 
the integral analog to the corresponding computations for the geo- 
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metric pmf, as described in appendix B. In particular, it follows from 
(B.9) that 



■oo 



A e~ Xr dr = 1, 



from (B.10) that 



m = 



and 



m = 



•OO -j 

rXe~ Xr dr = — 
X 



■oo o 

r 2 Xe~ Xr dr = —7? 

X z 



and hence from (2.65) 



2 1 _ 1 
_ A2' 



(2.70) 



(2.71) 



(2.72) 



(2.73) 



The moments can also be found by integration by parts. 

The Laplacian pdf is simply a mixture of an exponential pdf and 
its reverse, so its properties follow from those of an exponential pdf. 
The details are left as an exercise. 

The Gaussian pdf example is more involved. In appendix B, it is 
shown (in the development leading up to (B.13)) that 




| (x — mp 

■ e 2 cr 2 

V2a 2 



dx = 1. 



(2.74) 



It is reasonably easy to find the mean by inspection. The func- 

(x — mp 

tion g(x) — (x — m)e 2 a 2 is an odd function, i.e., it has the form 
g{— x) — —g(x), and hence its integral is 0 if the integral exists at 
all. 

This means that 




| (x — TTLp 

. xe 2 cr 2 dx = m 

s/2^ 



(2.75) 



The second moment and variance are most easily handled by the 
transform methods to be developed in Chapter 4. Their evaluation 
will be deferred until then, but we observe that the parameter a 2 
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which we have called the variance is in fact the variance, i.e., 






(2.76) 



Computing probabilities with the various pdf’s varies in difficulty. 
For simple pdf’s one can easily find the probabilities of simple sets like 
intervals. For example, with a uniform pdf on [a, b], then for any a < 
c < d < b Pr([c, d}) = (d — c)/(b — a), the probability of an interval 
is proportional to the length of the integral. For the exponential pdf, 
the probability of an interval [c, d ] , 0 < c < d, is given by 



Pr(M]) 




(2.77) 



The Gaussian pdf does not yield nice closed form solutions for the 
probabilities of simple sets like intervals, but it is well tabulated. 
Unfortunately there are several variations in table construction. The 
most common forms are the 4> function 



<F(<a) 



1 

\f2n 



•a 



— oo 



du , 



(2.78) 



which is the probability of the simple event (— oo,a] = {x : x < a} 
for a zero mean unit variance Gaussian pdf jV(0, 1). The Q function 
is the complementary function 



Q(a) 



1 

\/27T 



■OO 



u 

2 



du 



1 — <f>(<a). 



(2.79) 



The Q function is used primarily in communications systems analysis 
where probabilities of exceeding a threshold describe error events in 
detection systems. The error function is defined by 

9 

erf (a) = —= / e - “ 2 du (2.80) 

V 71 Jo 

and it is related to the Q and $ functions by 

Q(«) = ~ erf (7f) = 1 ~ $ (“)- ( 2 - 81 ) 



Thus, for example, the probability of the set (— 00 , a) for a 
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J\f(m , a 2 ) pdf is found by changing variables u — (x — m)/a to be 



mx ■ x - a}) - / 



a 



-oo V2n<j 2 



( x — m ) 2 

e 2 a 2 dx 



— oo 



v/2 



-.e 2 dx 



TV 



T/ <a — m. _ ^ ,a — m . . 

= $( =1-Q )• (2.82) 

(7 (7 

The probability of an interval (a, 6] is then given by 

n — 777 77 — 777 

P((a,6]) = P((—oo, b]) - P((-oo,a]) = $( ) - <&(- ). 

a a 

(2.83) 



a 



Observe that the symmetry of a Gaussian density implies that 



1 - $(a) = <&(— a). 



(2.84) 



As a multidimensional example of probability computation, sup- 
pose that the sample space is K 2 , the space of all pairs of real num- 
bers. The probability space consists of this sample space, the corre- 
sponding Borel field, and a probability measure described by a pdf 



ffay) 



A fie Xx x G [0,oc), y G [0,oo) 
0 otherwise 



What is the probability of the event F = {(x, y) : x < y}2 As an in- 
terpretation, the sample points (x, y) might correspond to the arrival 
times of two distinct types of particle at a sensor following its activa- 
tion, say type A and type B for x and y, respectively. Then the event 
is the event that a particle of type A arrives at the sensor before one 
of type B. Computation of the probability is then accomplished as 



P(F) = 




(x ,y):(x,y)eF 



f(x,y) dx dy 




(x,y):x>0,y>0,x<y 



\ne~ Xx -M dxdy. 



This integral is a two-dimensional integral of its argument over the 
indicated region. Correctly describing the limits of integration is 
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often the hardest part of computing probabilities. Note in particular 
the inclusion of the facts that both x and y are nonnegative (since 
otherwise the pdf is 0). The x < y region for nonnegative x and y is 
most easily envisioned as the region of the first quadrant lying above 
the line x = y, if x and y correspond to the horizontal and vertical 
axes, respectively. Completing the calculus: 




Mass Functions as Densities 

As in systems theory, discrete problems can be considered as con- 
tinuous problems with the aid of the Dirac delta or unit impulse 
S(t), a generalized function or singularity function (also, mislead- 
ingly, called a distribution) with the property that for any smooth 
function {g{r)\ r G 5ft} and any a E 5ft 



J g(r)5(r — a) dr = g(a). 



(2.85) 



Given a pmf p defined on a subset of the real line Pi C 5ft, we can 
define a pdf / by 



/ O’) = ^2p(u)5(r - w). 



( 2 . 86 ) 
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This is indeed a pdf since /(r) >0 and 



f(r) dr = J 



r — uj M dr 



J 5{r — uj)dr 
= L 



In a similar fashion, probabilies are computed as 



1 F(r)f(r) dr = J 1 F (r) (y^p(cj)S(r — co)J dr 

— J lF(r)5(r — u) dr 

= = P(F). 



Given that discrete probability can be handled using the tools 
of continuous probability in this fashion, it is natural to inquire 
wheather pdf’s should be used in both the discrete and continu- 
ous case. The main reason for not doing so is simplicity. Pmf’s and 
sums are usually simpler to handle and evaluate than pdf’s and inte- 
grals. Questions of existence and limits rarely arise, and the notation 
is simpler. In addition, the use of Dirac deltas assumes the theory 
of generalized functions in order to treat integrals involving Dirac 
deltas as if they were ordinary integrals, so additional mathematical 
machinery is required. As a result, this approach is rarely used in 
genuinely discrete problems. On the other hand, if one is dealing 
with a hybrid problem that has both discrete and continuous com- 
ponents, then this approach may make sense because it allows the 
use of a single probability function, a pdf, throughout. 



Multidimensional pdf’s 

By considering multidimensional integrals we can also extend the 
construction of probabilities by integrals to finite-dimensional prod- 
uct spaces, e.g., 5ft 

Given the measurable space (5ft fc , 23( 5ft) fc ), say we have a real-valued 




78 



Probability 



function / on R k with the properties that 

/(x) > 0 ; all X = (x 0 ,xi,...,x k -i) £ , 



(2.87) 




Then define a set function P 



/(x)dx = 1 . 

by 



P(F) = J /(x) dx, all F £ 



( 2 . 88 ) 



(2.89) 



where the vector integral is shorthand for the /c— dimensional integral, 
that is, 



F(F) = / f(xo,xi,...,x k -i)dx 0 dxi...dxk-i ■ 

J (xo,x\,...,x k -i )eF 

Note that (2.87) to (2.89) are exact vector equivalents of (2.54) 
to (2.56). As with multidimensional pmf’s, a pdf is not itself the 
probability of anything. As in the scalar case, however, the mean 
value theorem of calculus can be used to interpret the pdf as being 
proportional to the probability of being in a very small region around 
a point, i.e., that 



P({(a 0 ,a i, . . .,a k -i) : x t < ai < Xi + A*; i = 0, 1, . . . , n - 1}) 

~ f(x o,xi, . . . ,x fc _i)A 0 Ai • • • A n _i. (2.90) 

Is P defined by (2.89) a probability measure? The answer is 
a qualified yes with exactly the same qualifications as in the one- 
dimensional case. 

As in the one-dimensional sample space, a function / with the 
above properties is called a probability density function or pdf. To 
be more concise we will occasionally refer to a pdf on k— dimensional 
space as a k— dimensional pdf. 

There are two common and important examples of /c— dimensional 
pdf’s. These are defined next. In both examples the dimension k of 
the sample space is fixed and the pdf’s induce a probability measure 
on (3f by (2.89). 
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[2.16] The product pdf. 

Let /*;i = 0,l,...,/c — 1, be a collection of one-dimensional 
pdf’s; that is, fi(r); rGS satisfies (2.54) and (2.55) for each 
i = 0, 1, . . . , k — 1. Define the product k — dimensional pdf / by 

k - 1 

/(x) = /(x 0 , Xi, . . . , X fe _i) = Yl M x i) • 

z=0 

The product pdf in /c— dimensional space is simply the product 
of k pdf’s on one-dimensional space. The one-dimensional pdf’s are 
called the marginal pdf’s, and the multidimensional pdf is sometimes 
called a joint pdf. It is easy to verify that the product pdf integrates 
to 1. 

The case of greatest importance is when all of the marginal pdf’s 
are identical, that is, when fi(r) = fo(r) for all i. Note that any of the 
previously defined pdf’s on 5ft yield a corresponding multidimensional 
pdf by this construction. In a similar manner we can construct pmf’s 
on discrete product spaces as a product of marginal pmf’s. 

[2.17] The multidimensional Gaussian pdf. 

Let m = (mo, mi, . . . , m^-i)* denote a column vector (the super- 
script t stands for “transpose”). Let A denote a /c by /c square 
matrix with entries {\ij; i = 0, 1, . . . , k — 1; j = 0, 1, . . . , k — 1}. 
Assume that A is symmetric; that is, that A* = A or, equivalently, 
that A ij = A j^, all i,j. Assume also that A is positive definite ; 
that is, for any nonzero vector y E Jft^ the quadratic form y^Ay 
is positive, that is, 

k— 1 k — 1 

y‘ A y = N N y^jVj > 0 • 

2 = 0 j = 0 

A multidimensional pdf is said to be Gaussian if it has the fol- 
lowing form for some vector m and matrix A satisfying the above 
conditions: 

/(x) = (27r)- fc/2 (detA)- 1 / 2 e - 1/2(x - m)tA ' 1(x ~ m) ; xe»‘. 
where det A is the determinant of the matrix A. 

Since the matrix A is positive definite, the inverse of A exists and 
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hence the pdf is well defined. It is also necessary for A to be positive 
definite if the integral of the pdf is to be finite. The Gaussian pdf 
may appear complicated, but it will later be seen to be one of the 
simplest to deal with. We shall later develop the significance of the 
vector m and matrix A. Note that if A is a diagonal matrix, example 
[2.17] reduces to a special case of example [2.16]. 

The reader must either accept on faith that the multidimensional 
Gaussian pdf integrates to 1 or seek out a derivation. 

The Gaussian pdf can be extended to complex vectors if the con- 
straints on A are modified to require that A* = A, where the asterix 
denotes conjugate transpose, and where for any vector y not identi- 
cally 0 it is required that y*Ay > 0. 



[2.18] Mixtures. 

Suppose that Pi, i = 1, 2, . . . , oo is a collection of probability mea- 
sures on a common measurable space (f],P), and let a*, i — 
1,2,... be nonnegative numbers that sum to 1. Then the set 
function determined by 

oo 

P(F) = ]T ciiPi(F) 

2=1 

is also a probability measure on (f],P). This relation is usually 
abbreviated to 

oo 

P = ^ / a i P i . 

2=1 

The first two axioms are obviously satisfied by P, and countable 
additivity follows from the properties of sums. (Finite additivity is 
easily demonstrated for the case of a finite number of nonzero aq.) A 
probability measure formed in this way is called a mixture. Observe 
that this construction can be used to form a probability measure 
with both discrete and continuous aspects. For example, let Pi be 
the real line and T the Borel field; suppose that / is a pdf and p is 
a pmf; then for any A G (0, 1) the measure P defined by 



P(F) 



A y] p(x) + (i - A) 

xeF 




f(x)dx 
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combines a discrete portion described by p and a continuous por- 
tion described by /. Expectations can be computed in a similar way. 
Given a function g , 



E(g) 



A 5 ^ g{x)p(x) + (1 - A) 

xeF 




g(x)f(x)dx. 



Note that this construction works for both scalar and vector spaces. 
This combination of discrete and continuous attributes is one of the 
main applications of mixtures. Another is in modeling a random 
process where there is some uncertainty about the parameters of 
the experiment. For example, consider a probability space for the 
following experiment: First a fair coin is flipped and a 0 or 1 (tail or 
head) observed. If the coin toss results in a 1, then a fair die described 
by a uniform pmf p\ is rolled, and the outcome is the result of the 
experiment. If the coin toss results in a 0, then a biased die described 
by a nonuniform pmf p 2 is rolled, and the outcome is the result of the 
experiment. The pmf of the overall experiment is then the mixture 
Pi /2 -\- P2/2. The mixture model captures our ignorance of which die 
we will be rolling. 



2.6 Independence 

Given a probability space (f2,F, P), two events F and G are defined 
to be independent if P(F flG) = P(F)P(G). A collection of events 
{Fp i — 0,l,...,fc — 1} is said to be independent or mutually inde- 
pendent if for any distinct subcollection {Fj.; i = 0, 1, . . . , m — 1}, 
l m < k, we have that 

( m— 1 \ m—1 

n f ‘> = n p (fo . 

i= 0 / i— 0 

In words: the probability of the intersection of any subcollection 
of the given events equals the product of the probabilities of the 
separate events. Unfortunately it is not enough to simply re- 
quire that P (flto 1 = nto P(P0 as this does not imply a sim- 
ilar result for all possible subcollections of events, which is what 
will be needed. For example, consider the following case where 
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P(F fl G fl H) = P(F)P(G)P(H) for three events F, G, and F, yet 
it is not true that P(F P\G) = P(F)P(G) 

P(F ) = P(G) = P(H) = 1 

P(F n G n H) = F = P(F)P(G)P(H) 

P(F n G) = P(G n H) = P(F n H) 

The example places zero probability on the overlap F P\ G except 
where it also overlaps F, i.e., F(F flGfl H c ) = 0. Thus in this case 
P(F fl G n H) = P(F)P(G)P(F) = 1/27, but P(F n G) = 1/27 ^ 
P(F)P(G) = 1/9. 

The concept of independence in the probabilistic sense we have de- 
fined relates easily to the intuitive idea of independence of physical 
events. For example, if a fair die is rolled twice, one would expect 
the second roll to be unrelated to the first roll because there is no 
physical connection between the individual outcomes. Independence 
in the probabilistic sense is reflected in this experiment. The prob- 
ability of any given outcome for either of the individual rolls is 1/6. 
The probability of any given pair of outcomes is (1/6) 2 = 1/36 - 
the addition of a second outcome diminishes the overall probability 
by exactly the probability of the individual event, viz., 1/6. Note 
that the probabilities are not added — the probability of two succes- 
sive outcomes cannot reasonably be greater than the probability of 
either of the outcomes alone. Do not , however, confuse the concept 
of independence with the concept of disjoint or mutually exclusive 
events. If you roll the die once, the event the roll is a one is not 
independent of the event the roll is a six. Given one event, the other 
cannot happen — they are neither physically nor probabilistically 
independent. They are mutually exclusive events. 



= P^P(F)P(G). 



2.7 Elementary Conditional Probability 

Intuitively, independence of two events means that the occurrence of 
one event should not affect the occurrence of the other. For exam- 
ple, the knowledge of the outcome of the first roll of a die should not 
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change the probabilities for the outcome of the second roll of the die 
if the die has no memory. To be more precise, the notion of con- 
ditional probability is required. Consider the following motivation. 
Suppose that (f],F, P) is a probability space and that an observer 
is told that an event G has already occurred. The observer thus 
has a posteriori knowledge of the experiment. The observer is then 
asked to calculate the probability of another event F given this in- 
formation. We will denote this probability of F given G by P(F|G). 
Thus instead of the a priori or unconditional probability P(F), the 
observer must compute the a posteriori or conditional probability 
P(F|G), read as “the probability that event F occurs given that the 
event G occurred.” For a fixed G the observer should be able to find 
P(F|G) for all events F, thus the observer is in fact being asked to 
describe a new probability measure, say Pq, on (f2,F). How should 
this be defined? Intuition will lead to a useful definition and this 
definition will provide a useful interpretation of independence. 

First, since the observer has been told that G has occurred and 
hence uj E G, clearly the new probability measure Pq must assign 
zero probability to the set of all uj outside of G, that is, we should 
have 



P(G C \G) = 0 



(2.91) 



or, equivalently, 

P(G\G) = 1. (2.92) 

Eq. (2.91) plus the axioms of probability in turn imply that 

P(F\G) = P(F n(G U G C )\G) = P(F n G\G). (2.93) 



Second, there is no reason to suspect that the relative probabilities 
within G should change because of the conditioning. For example, if 
an event F C G is twice as probable as an event H C G with respect 
to P, then the same should be true with respect to Pq. For arbitrary 
events F and F, the events F n G and H H G are both in G, and 
hence this preservation of relative probability implies that 

P(PnG|G) _ P(FHG) 

P(H n G|G) ” P(H n G) ' 
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But if we take H = Pi in this formula and use (2. 92)- (2. 93), we have 
that 

P(F\G) = P(F n G\G) = P{ p^ ] > ( 2 - 94 ) 

which is in fact the formula we now use to define the conditional 
probability of the event F given the event G. The conditional prob- 
ability can be interpreted as “cutting down” the original probability 
space to a probability space with the smaller sample space G and 
with probabilities equal to the renormalized probabilities of the in- 
tersection of events with the given event G on the original space. 

This definition meets the intuitive requirements of the derivation, 
but does it make sense and does it fulfill the original goal of provid- 
ing an interpretation for independence? It does make sense provided 
P(G) >0, that is, the conditioning event does not have zero proba- 
bility. This is in fact the distinguishing requirement that makes the 
above definition work for what is known as elementary conditional 
probability. Non-elementary conditional probability will provide a 
more general definition that will work for conditioning events having 
zero probability, such as the event that a fair spin of a pointer results 
in a reading of exactly 1 /tt. Further, if P is a probability measure, 
then it is easy to see that Pq defined by Pg(P) — P{F\G) for F E T 
is also a probability measure on the same space (remember G stays 
fixed), i.e., Pq is a normalized and countably additive function of 
events. As to independence, suppose that F and G are independent 
events and that P(G) > 0, then 



P(F\G) 



P(F n G) 
P(G) 



P(F), 



the probability of F is not effected by the knowledge that G has 
occurred. This is exactly what one would expect from the intu- 
itive notion of the independence of two events. Note, however, that 
it would not be as useful to define independence of two events by 
requiring P(F) = P(F\G) since it would be less general than the 
product definition; it requires that one of the events have a nonzero 
probability. 

Conditional probability provides a means of constructing new 
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probability spaces from old ones by using conditional pmf ’s and ele- 
mentary conditional pdf’s. 

[2.18] Suppose that (fl,P, P) is a probability space described by 
a pmf p and that A is an event with nonzero probability. Then 
the pmf pa defined by 

a _ f = u G A 

[0 UJ ^ A 

is a pmf and implies a probability space (fl, P, Pa)-, where 

Pa(F) = Y / Pa(A = P(F\A). 

lu<EF 

Pa is called a conditional pmf. More specifically, it is the con- 
ditional pmf given the event A. In some cases it may be more 
convenient to define the conditional pmf on the sample space A 
and hence the conditional probability measure on the original 
event space. 

As an example, suppose that p is a geometric pmf and that A = 
{uj \ uj > K} = {K,K P 1,...}. In this case the conditional pmf 
given that the outcome is greater than or equal to K is 

(l-:P) fc_1 P (1 -pf^p 

z2i= K a ~ p) l p (i -p) K 1 

= (1 - p) k ~ K p ; k = K + 1, K + 2, . . (2.95) 




which can be recognized as a geometric pmf which begins at k = 
K + 1. 

[2.19] Suppose that (fl,P, P) is a probability space described by 
a pdf / and that A is an event with nonzero probability. Then 
the /a defined by 






/H 

P(A) 

0 



u) £ A 
uj £ A 



is a pdf on A and describes a probability measure 
Pa(F)= f f A (u)dw = P(F\A). 

Jlo&F 



(2.96) 
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/a is called an elementary conditional pdf (given the event A). 
The word “elementary” reflects the fact that the conditioning 
event has nonzero probability. We will later see how conditional 
probability can be usefully extended to conditioning on events of 
zero probability and why they are important. 

As a simple example, consider the continuous analog of the pre- 
vious conditional geometric pmf example. Given an exponential pdf 
and A = {r : r > c}, define 

Xe~ Xx Xe~ Xx 
U[X) ~ f c °° Xe~ x v dy ~ 

= X e - x ( x -c). x >c, (2.97) 

which can be recognized as an exponential pdf that starts at c. The 
exponential pdf and geometric pmf share the unusual property that 
conditioning on the output being larger than some number does not 
change the basic form of the pdf or pmf, only its starting point. This 
has the discouraging implication that if, for example, the time for 
the next arrival of a bus is described by an exponential pdf, then 
knowing you have already waited for an hour does not change your 
pdf to the next arrival from what it was when you arrived. 



2.8 Problems 

1 . Suppose that you have a set function P defined for all subsets F C O of a 
sample space O and suppose that you know that this set function satisfies 
(2. 7-2. 9). Show that for arbitrary (not necessarily disjoint) events, 

P{F U G) = P{F) + P{G) - P{F n G) . 

2. Given a probability space (Q, P, P), and let F, G , and H be events such 
that P(F D G\H) = 1. Which of the following statements are true? Why 
or why not? 

(a) P(F H G) = 1 

(b) P(FnGnH) = P{H) 

(c) P(F C \H) = 0 

(d) h = n 

3. Describe the sigma- field of subsets of 5ft generated by the points or sin- 
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gleton sets. Does this sigma-field contain intervals of the form (a, b ) for 
b> a? 

4. Given a finite subset A of the real line 5ft, prove that the power set of A 
and 13(A) are the same. Repeat for a countably infinite subset of 5ft. 

5. Given that the discrete sample space Q has n elements, show that the 
power set of D consists of 2 n elements. 

6. *Let D = 5ft, the real line, and consider the collection T of subsets of 5ft 
defined as all sets of the form 

k rri 

U ^ ai ’ ^ LJ 

i — 0 j = 0 

for all possible choices of nonnegative integers k and m and all possible 
choices of real numbers cq < G, Ci < di. If k or m is 0, then the respective 
unions are defined to be empty so that the empty set itself has the form 
given. In other words, T contains all possible finite unions of half-open 
intervals of this form and complements of such half-open intervals. Every 
set of this form is in T and every set in T has this form. Prove that T 
is a field of subsets of Q. Does T contain the points? For example, is the 
singleton set {0} in Is T a sigma-field? 

7. Let D = [0, oo) be a sample space and let T be the sigma-field of subsets 
of D generated by all sets of the form (n, n + 1) for n — 1,2,... 

(a) Are the following subsets of D in FI (i) [0, oo), (ii) Z + = {0, 1, 2, . . .}, 
(iii) [0, k\ U [k + 1, oo) for any positive integer k, (iv) {k} for any 
positive integer k, (v) [0 , &] for any positive integer k, (vi) (1/3, 2). 

(b) Define the following set function on subsets of D : 

P(F) =c J2 3_i 

i<EZ + : i+l/2<EF 

(If there is no i for which G 1/2 G F, then the sum is taken as zero.) 
Is P a probability measure on (D,^ 7 ) for an appropriate choice of c? 
If so, what is c? 

(c) Repeat part (b) with B, the Borel field, replacing T as the event 
space. 

(d) Repeat part (b) with the power set of [0, oo) replacing T as the event 
space. 

(e) Find P(F) for the sets F considered in part (a). 

8. Show that an equivalent axiom to 2.3 of probability is the following: 

If F and G are disjoint, then P(F U G) = P(F) + P(G) 



5 




88 Probability 

that is, we really need only specify finite additivity for the special case 
of n = 2. 

9. Consider the measurable space ([0, 1], S([0, 1])). Define a set function P 
on this space as follows: 

( 1/2 if 0 G F or 1 G F but not both 
P(F) = | 1 if 0 G F and 1 G F 
{ 0 otherwise . 

Is P a probability measure? 

10. Let S be a sphere in 5ft 3 : S = {(x, y, z) : x 2 + y 2 + z 2 < r 2 }, where r is 
a fixed radius. In the sphere are fixed N molecules of gas, each molecule 
being considered as an infinitesimal volume (that is, it occupies only a 
point in space). Define for any subset F of S the function 

#(F) = {the number of molecules in F} . 

Show that P(F) = #(F)/N is a probability measure on the measurable 
space consisting of S and its power set. 



11. ★ Suppose that you are given a probability space (ft,F, P) and that a 
collection Fp of subsets of ft is defined by 

Tp = {F U N ; all F G F , all N C G for which G G T and P(G) = 0}. 

(2.98) 

In words: Fp contains every event in F along with every subset N which 
is a subset of zero probability event G G F ^ whether or not N is itself 
an event (a member of F). Thus Fp is formed by adding any sets not 
already in Fp which happen to be subsets of zero probability events. We 
can define a set function P for the measurable space (fl,Fp) by 

P{F UN) = P(F) if F G F and N C G G F, where P(G) = 0. (2.99) 

Show that (ft, Fp , P) is a probability space, i.e., you must show that Fp 
is an event space and that P is a probability measure. A probability 
space with the property that all subsets of zero probability events are 
also events is said to be complete and the probability space (ft,Fp,P) 
is called the completion of the probability space (ft,F,P). 

In problems 2.12 to 2.18 let (ft,F,P) be a probability space and 
assume that all given sets are events. 
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12. If G C F, prove that F(F — G) = F(F) — F(G). Use this fact to prove 
that if G C F, then P(G) < F(F). 

13. Let {Fj} be a countable partition of a set G. Prove that for any event 

H, 

p i H n F i) = p { H n G ) . 

i 

In words: If the probability of the symmetric difference of two events is 
small, then the two events must have approximately the same probability. 

14. If {Fi, i = 1,2,...} forms a partition of Q and {Gi\ i = 1,2,...} forms 
a partition of D, prove that for any H, 

oo oo 

P(H DFi nGj) . 

2=1 3 = 1 

15. Prove that | P(F) - P{G)\ < P(FAG). 

16. Prove that P(F U G) < P(F) + P(G). Prove more generally that for any 
sequence (i.e., countable collection) of events Fi , 

( oo \ oo 

U d<E p ( f -)' 

2=1 / 2=1 

This inequality is called the union bound or the Bonferoni inequality. 
{Hint: Use problem A. 2 or 2.1.) 

17. Prove that for any events F, G, and H , 

P{FAG) < P{FAH) + P{HAG) . 

The astute observer may recognize this as a form of the triangle inequal- 
ity; one can consider P{F AG) as a distance or metric on events. 

18. Prove that if P{F) >1 — 5 and P{G) >1 — 5, then also P{F D G) > 
1 — 25. In other words, if two events have probability nearly one, then 
their intersection has probability nearly one. 

19. *The Cantor set Consider the probability space (U, B(Q), P) where P is 
described by a uniform pdf on Q = [0, 1). Let F\ = (1/3, 2/3), the mid- 
dle third of the sample space. Form the set G\ = Q — F± by removing 
the middle third of the unit interval. Next define F 2 as union of the mid- 
dle thirds of all of the intervals in Gi, i.e., F 2 = (1/9, 2/9) (J(7/9,8/9). 
Define G 2 as what remains when remove F 2 from Gi, that is, 

G 2 = Gi - F 2 = [0, 1] - (-Fi (J F 2 ). 
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Continue in this manner. At stage n F n is the union of the middle 
thirds of all of the intervals in G n -i = [0,1] — Ufc=i Fn- The Cantor set 
is defined as the limit of the G n , that is, 

oo oo 

C= p| G„ = [0, 1] - \jF n . (2.100) 

n — 1 n — 1 

(a) Prove that C G i.e., that it is an event. 

(b) Prove that 

p { F n) = \{\) n ~ 1 \n = l,2,.... ( 2 . 101 ) 

(c) Prove that P(C) = 0, i.e., that the Cantor set has zero probability. 
One thing that makes this problem interesting is that unlike most simple 
examples of nonempty events with zero probability, the Cantor set has 
an uncountable infinity of points and is not a discrete set of points. This 
can be shown be first showing that a point x G C if and only if the point 
can be expressed as a ternary number x = Y^=i a n3~ n where all the 
a n are either 0 or 2. Thus the number of points in the Cantor set is 
the same as the number of real numbers that can be expressed in this 
fashion, which is the same as the number of real numbers that can be 
expressed in a binary expansion (since each a n can have only two values), 
which is the same as the number of points in the unit interval, which is 
uncount ably infinite. 

20. Six people sit at a circular table and pass around and roll a single fair die 
(equally probable to have any face 1 through 6 showing) beginning with 
person 1. The game continues until the first 6 is rolled, the person 
who rolled it wins the game. What is the probability that player 2 
wins? 

21. Show that given (2.22) through (2.24), (2.28) or (2.29) implies (2.25). 
Thus (2.25), (2.28), and (2.29). provide equivalent candidates for the 
fourth axiom of probability. 

22. Suppose that P is a probability measure on the real line and define the 
sets F n = (0, l/n) for all positive integer n. Evaluate lim n _>oo P(F n ). 

23. Answer true or false for each of the following statements. Answers must 
be justified. 

(a) The following is a valid probability measure on the sample space 
= {1, 2, 3, 4, 5, 6} with event space T = all subsets of Q: 

p(f) = ^ E a11 F e r. 

ieF 
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(b) The following is a valid probability measure on the sample space 
D = {1, 2, 3, 4, 5, 6 } with event space F = all subsets of D: 

r . f 1 if 2 G F or 6 G F 

P(F) = { 

I 0 otherwise 

(c) If P(G U F) = P(F) + P(G), then F and G are independent. 

(d) P(F\G) > P(G) for all events F and G. 

(e) Mutually exclusive (disjoint) events with nonzero probability cannot 
be independent. 

(f) For any finite collection of events Fi\ i = 1,2, * * * , AT 

N 

i— 1 

24. Prove or provide a counterexample for the relation P(F\G) + P{F\G C ) = 

p(p). 

25. Find the mean, second moment, and variance of a uniform pdf on an 
interval [a, b). 

26. Given a sample space D = {0, 1,2,***} define 

p 0) = ^; k = o,i,2,- •• 

(a) What must 7 be in order for p{k) to be a pmf? 

(b) Find the probabilities P({0, 2, 4, 6 , • • • }), jP({ 1 ,3,5,7, •••}), and 
P({0,1,2,3,4,...,20}). 

(c) Suppose that K is a fixed integer. Find P({0, K, 2K, 3 K , ...}). 

(d) Find the mean, second moment, and variance of this pmf. 

27. Suppose that p(k) is a geometric pmf. Define q(k) = (p(k) F p{—k)) / 2. 
Show that this is a pmf and find its mean and variance. Find the prob- 
ability of the sets {k : \k\ > K} and {k : k is a multiple of 3}. Find the 
probability of the sets {k : k is odd } 

28. Define a pmf p(k) = GA^'I /\k\\ for k € Z, A>0. Evaluate the constant 
C and find the mean and variance of this pmf. 

29. A probability space consists of a sample space Q = all pairs of positive in- 
tegers (that is,D = {l,2,3,...} 2 ) and a probability measure P described 
by the pmf p defined by 

p(k,m)=p 2 (l-p) k+7n - 2 . 



(a) Find P({(k,m) : k > m}). 
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(b) Find the probability P({(/c, m) : k + m = r}) as a function of r for 
r = 2,3,... Show that the result is a pmf. 

(c) Find the probability P({(fc,ra) : k is an odd number}). 

(d) Define the event F = {(/c,m) : k > m}. Find the conditional pmf 
PF(k,m) = P({k,m}\F). Is this a product pmf? 

30. The probability that Riddley Walker goes for a run in the morning before 

work is |. If he runs then the probability that he catches the train to 

work is If he does not run then the probability that he catches the 

train to work is If he does not catch the train, then he catches the 

bus. The model holds Monday through Friday. 

(a) What is the probability that Riddley gets the train any morning? 

(b) You are told that Riddley made the train - what is the probability 
that he ran? 

(c) What is the probability that Riddley catches the train exactly twice 
this week (out of the 5 working days)? 

(d) What is the expected number of times during the week that Riddley 
will catch the train? You can leave you answer in terms of a sum. 

31. You roll a six-sided die until either a 2 shows or an odd number shows. 
What is the probability of rolling a 2 before rolling an odd number? 

32. Rita and Ravi are starting a company. Together they must raise at least 
$100,000. Each raises money with a uniform distribution between $0 and 
$100,000. (Assume that money is continuous and that this is a uniform 
pdf - it is easier that way.) If either of them individually raises more than 
$75,000 they have to fill out extra IRS forms. What is the probability 
that they raise enough money but neither has to fill out extra forms? 

33. The probability that a man has a particular disease is Aj. John is tested 
for the disease but the test is not totally accurate. The probability that 
a person with the disease tests negative is Ay while the probability that 
a person who does not have the disease tests positive is Aj. John’s test 
returns positive. 

(a) Find the probability that John has the disease. 

(b) You are now told that this disease is hereditary. The probability that 
a son suffers from the disease if his father does is | , the probability 
that a son is infected with the disease even though his father is 
not is Ar. What is the probability that Max has the disease given 
that his son Peter has the disease? (Note: You may assume that 
the disease only affects males so you can ignore the dependence on 
Peter’s mother’s health.) 

(c) Michael is also tested but he worries about the accuracy of the test 
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so he takes the test 10 times. One of the ten tests turns out positive, 
the other nine negative. What is the probability that Michael has 
the disease? 

34. Define the uniform probability density function on [0,1) in the usual way 
as 



! 1 0 < r < 1 

0 otherwise 

(a) Define the set F = {0.25,0.75}, a set with only two points. What is 
the value of 



f(r) dr? 



The Riemann integral is well defined for a finite collection of points 
and this should be easy. What is J FC f(r) dr ? 

(b) Now define the set F as the collection of all rational numbers in [0,1), 
that is, all numbers that can be expressed as k/n for some integers 
0 < k < n. What is the integral f F f(r) dr ? Is it defined? Thinking 
intuitively, what should it be? Suppose instead you consider the set 
F c , the set of all irrational numbers in [0, 1). What is J pc f(r) dr ? 

35. Given the uniform pdf on [0, 1], f(x) = 1; x G [0,1], find an expression for 
P((a, b)) for all real b > a. Define the cumulative distribution function or 
cdf F as the probability of the event {x : x < r} as a function of r G 5ft: 



F(r) = P((— oo, r]) = f f(x)dx. (2.102) 

J — oo 



Find the cdf for the uniform pdf. Find the probability of the event 



G = < to : io G 



1 1 



1 



2& ’ 2 k + 2 /c+1 



for some even k 



- U 



k even L 



1 1 



1 



2 fc ’ 2 k + 2 fc+1 



36. * Let Q be a unit square {(x,y) : (x, y) G 3? 2 , — 1/2 < x < 1/2, — 1/2 < 
y < 1/2} and let T be the corresponding product Bor el field. Is the circle 
{(x,y) : (x 2 Fy 2 ) 1 ^ 2 < 1/2} in JF? (Give a plausibility argument.) If 
so, find the probability of this event if one assumes a uniform density 
function on the unit square. 

37. Given a pdf /, find the cumulative distribution function or cdf F defined 
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as in (2.102) for the exponential, Laplacian, and Gaussian pdf’s. In the 
Gaussian case, express the cdf in terms of the <f> function. Prove that if 
a > 6, then F(a) > F(b). What is dF ^ ? 

38. Let Q, = 5ft 2 and suppose we have a pdf f(x,y) such that 

, , f (7 if x > 0, y > 0, x -\- y <1 

/(a: ’ y)= lo otherwise- 

Find the probability P({(x,y) : 2 x > y}). Find the probability 
P({(x,y) : x < a}) for all real a. Is / a product pdf? 

39. Prove that the product k— dimensional pdf integrates to 1 over 5ft. 

40. Given the one-dimensional exponential pdf, find P({x : x > r}) and the 
cumulative distribution function P({x : x < r}) for r G 5ft. 

41. Given the k— dimensional product doubly exponential pdf, find the prob- 

abilities of the following events in 5ft fe : {x : xo > 0}, {x : Xi > 0, all i = 
0,1 1 }, {x : xq > xi}. 

42. Let (D,JF) = (5ft, S(5ft)). Let Pi be the probability measure on this space 
induced by a geometric pmf with parameter p and let P 2 be the probabil- 
ity measure induced on this space by an exponential pdf with parameter 
A. Form the mixture measure P = Pi/2 + P 2 / 2 . Find P({cj : c o > r}) 
for all r G [0, 00). 

43. Let = 5ft 2 and suppose we have a pdf /(x,y) such that 

/(x, y) = Ce~^ v l 2a e~ Xy ; x G (—00,00) , y G [ 0 ,oo) . 



Find the constant C. Is f a product pdf? Find the probability 
Pr ({(x, 2 /) : y/fxf < a}) for all possible values of a parameter a. Find 
the probability Pr({(x,y) : x 2 < y}). 

44. Define g(x) by 



9 0*0 



Ae Ax x G [0, 00) 
0 otherwise . 



Let D = 5ft 2 and suppose we have a pdf /(x, y) such that 



f(x,y) = Cg(x)g(y-x) . 



Find the constant C. Find an expression for the probability P({(x,y) : 
y < a}) as a function of the parameter a. Is / a product pdf? 

45. Let D = 5ft 2 and suppose we have a pdf such that 



f(x,y) 



C \x\ — 1 < x < 1; — 1 < y < x 
0 otherwise . 
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Find the constant C. Is / a product pdf? 

46. Suppose that a probability space has sample space 77 n , the n-dimensional 
Euclidean space. (This is a product space.) Suppose that a multidimen- 
sional pdf / is defined on this space by 



/(x) 



fC; 


maxj \xi\ 


1 °; 


otherwise 



that is, /(x) = C when —1/2 < Xi < 1/2 for i = 0, 1, • • • , n — 1 and is 0 
otherwise. 

(a) What is Cl 

(b) Is / a product pdf? 

(c) What is P({x : min* x* > 0}), that is, the probability that the small- 
est coordinate value is nonnegative. 

Suppose next that we have a pdf g defined by 



where 




||x|| < 1 
otherwise, 




n— 1 



\ 



Y, x i 

i = 0 



is the Euclidean norm of the vector x. Thus g equals K inside an 
n-dimensional sphere of radius 1 centered at the origin. 

(d) What is the constant K ? (You may need to go to a book of integral 
tables to find this.) 

(e) Is this density a product pdf? 

47. Let (Q,F, P) be a probability space and consider events F, G, and H 
for which P(F) > P{G) > P(H) > 0. Events F and G form a partition 
of Q, and events F and H are independent. Can events G and H be 
disjoint? 

48. (Courtesy of Prof. T. Cover) Suppose that the evidence of an event F 
increases the likelihood of a criminals guilt; that is, if G is the event that 
the criminal is guilty, then P(G|F) > P(G). The prosecutor discovers 
that the event F did not occur. What do you now know about the 
criminal’s guilt? Prove your answer. 




3 

Random Variables, Vectors, and Processes 



3.1 Introduction 

This chapter provides theoretical foundations and examples of ran- 
dom variables, vectors, and processes. All three concepts are varia- 
tions on a single theme and may be included in the general term of 
random object. We will deal specifically with random variables first 
because they are the simplest conceptually — they can be considered 
to be special cases of the other two concepts. 



3.1.1 Random Variables 

The name random variable suggests a variable that takes on values 
randomly. In a loose, intuitive way this is the right interpretation 
e.g., an observer who is measuring the amount of noise on a commu- 
nication link sees a random variable in this sense. We require, how- 
ever, a more precise mathematical definition for analytical purposes. 
Mathematically a random variable is neither random nor a variable 
- it is just a function mapping one sample space into another space. 
The first space is the sample space portion of a probability space, 
and the second space is a subset of the real line (some authors would 
call this a “real- valued” random variable). The careful mathemati- 
cal definition will place a constraint on the function to ensure that 
the theory makes sense, but for the moment we informally define a 
random variable as a function. 

A random variable is perhaps best thought of as a measurement 
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on a probability space; that is, for each sample point oj the random 
variable produces some value, denoted functionally as /(a;). One 
can view oj as the result of some experiment and /( oj) as the result 
of a measurement made on the experiment, as in the example of 
the simple binary quantizer introduced in the introduction to chap- 
ter 2. The experiment outcome oj is from an abstract space, e.g., real 
numbers, integers, ASCII characters, waveforms, sequences, Chinese 
characters, etc. The resulting value of the measurement or random 
variable /(a;), however, must be “concrete” in the sense of being a 
real number, e.g., a meter reading. The randomness is all in the 
original probability space and not in the random variable; that is, 
once the oj is selected in a “random” way, the output value or sample 
value of the random variable is determined. 

Alternatively, the original point oj can be viewed as an “input sig- 
nal” and the random variable / can be viewed as “signal processing,” 
i.e., the input signal oj is converted into an “output signal” f[oj) by 
the random variable. This viewpoint becomes both precise and rel- 
evant when we choose our original sample space to be a signal space 
and we generalize random variables by random vectors and processes. 

Before proceeding to the formal definition of random variables, 
vectors, and processes, we motivate several of the basic ideas by 
simple examples, beginning with random variables constructed on 
the fair wheel experiment of the introduction to chapter 2. 



A Coin Flip 

We have already encountered an example of a random variable in 
the introduction to chapter 2, where we defined a random variable 
q on the spinning wheel experiment which produced an output with 
the same pmf as a uniform coin flip. We begin by summarizing 
the idea with some slight notational changes and then consider the 
implications in additional detail. 

Begin with a probability space P) where Q = 5ft and the 

probability P is defined by (2.2) using the uniform pdf on [0, 1) of 
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(2.4) Define the function Y : 5ft — > {0, 1} by 



Jo if r < 0.5 
I 1 otherwise . 



(3-1) 



When Tyche performs the experiment of spinning the pointer, we 
do not actually observe the pointer, but only the resulting binary 
value of y. Y can be thought of as signal processing or as a measure- 
ment on the original experiment. Subject to a technical constraint 
to be introduced later, any function defined on the sample space of 
an experiment is called a random variable. The “randomness” of a 
random variable is “inherited” from the underlying experiment and 
in theory the probability measure describing its outputs should be 
derivable from the initial probability space and the structure of the 
function. To avoid confusion with the probability measure P of the 
original experiment, we refer to the probability measure associated 
with outcomes of Y as Py. Py is called the distribution of the ran- 
dom variable Y . The probability Py(F) can be defined in a natural 
way as the probability computed using P of all the original samples 
that are mapped by Y into the subset F: 



P y {F) = P({r : y(r) G F}). (3.2) 



In this simple discrete example Py is naturally defined for any subset 
F of Qy = {0, 1}, but in preparation for more complicated examples 
we assume that Py is to be defined for all suitably defined events, 
that is, for F G -By, where By is an event space consisting of subsets 
of Qy. The probability measure for the output sample space can 
be computed from the probability measure for the input using the 
formula (3.2), which will shortly be generalized. This idea of deriving 
new probabilistic descriptions for the outputs of some operation on an 
experiment which produces inputs to the operation is fundamental to 
the theories of probability, random processes, and signal processing. 




3.1 Introduction 



99 



For example, in our simple example (3.2) implies that 

Py({0}) = P({r : Y(r) = 0}) = P({r : 0 < r < 0.5}) 

= P([0, 0.5]) = 0.5 
P Y ({1}) =P((0.5,1.0]) =0.5 

Py(n Y ) =iV({0,l}) =P(M) = 1 

ivw = pm = o, 

so that every output event can be assigned a probability by Py by 
computing the probability of the corresponding input event under 
the input probability measure P. 

Eq. (3.2) can be written in a convenient compact manner by means 
of the definition of the inverse image of a set P under a mapping 
Y : fi Qy: 

y -1 (F) = {r : Y(r) G F}. (3.3) 

With this notation (3.2) becomes 

P y (F) = P(Y-\F)y, F c n Y ; (3.4) 

that is, the inverse image of a given set (output) under a mapping is 
the collection of all points in the original space (input points) which 
map into the given (output) set. This result is sometimes called 
the fundamental derived distribution formula or the inverse image 
formula. It will be seen in a variety of forms throughout the book. 
When dealing with random variables it is common to interpret the 
probability Py(P) as “the probability that the random variable Y 
takes on a value in P” or “the probability that the event Y G F 
occurs.” These English statements are often abbreviated to the form 
Pr(T G P). 

The probability measure Py can be computed by summing a pmf, 
which we denote py. In particular, if we define 

py(v) = Pv{{y})\ y g n Y % (3.5) 

then additivity implies that 

Py(F) = Y^ PYiyy F e B Y - 

y£F 



(3.6) 
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Thus the pmf describing a random variable can be computed as a 
special case of the inverse image formula (3.5), and then used to 
compute the probability of any event. 

The indirect method provides a description of the fair coin flip in 
terms of a random variable. The idea of a random variable can also 
be applied to the direct description of a probability space. As in 
the introduction to chapter 2, we describe directly a single coin flip 
by choosing Q = {0, 1} and assign a probability measure P on this 
space as in (2.12). Now define a random variable V : {0, 1} — > {0, 1} 
on this space by 

V(r) = r. (3.7) 

Here V is trivial, it is just the identity mapping. The measurement 
just puts out the outcome of the original experiment and the inverse 
image formula trivially yields 

P V (F) = P(F) 

Pv(v) = p(v). 



Note that this construction works on any probability space having the 
real line or a Borel subset thereof as a sample space. Thus for each 
of the named pmf’s and pdf’s there is a random variable associated 
with that pmf or pdf. 

If we have two random variables V and Y (which may be defined 
on completely separate experiments as in the present case), we say 
that they are equivalent or identically distributed if Py{F) = Py{F) 
for all events F, that is, the two probability measures agree exactly 
on all events. It is easy to show with the inverse image formula that 
V is equivalent to Y and hence that 

Py{v) = Pv{y) = 0.5; y = 0, 1. (3.8) 

Thus we have two equivalent random variables, either of which can 
be used to model the single coin flip. Note that we do not say the 
random variables are equal since they need not be. For example, you 
could spin a pointer and find Y and I could flip my own coin to find 
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V. The probabilities are the same, but the outcomes might or might 
not differ. 



3.1.2 Random Vectors 

The issue of the possible equality of two random variables raises 
an interesting point. If you are told that Y and V are two sep- 
arate random variables with pmf’s py and py, then the question 
of whether or not they are equivalent can be answered from these 
pmf’s alone. If you wish to determine whether or not the two ran- 
dom variables are in fact equal , however, then they must be con- 
sidered together or jointly. In the case where we have a random 
variable Y with outcomes in {0, 1} and a random variable V with 
outcomes in {0, 1}, we could consider the two together as a single 
random vector {Y, V} with outcomes in the Cartesian product space 

Slyy = {0, l} 2 = {(0, 0), (0, 1), (1, 0), (1, 1)} with some pmf pyy de- 
scribing the combined behavior 

Pvy(y, v) = Pr(Y = y, V = v) (3.9) 

so that 

Pr((Y, V) G F) = J2 PyAV’ ^ F e B W, 

y,v:(y,v)eF 

where in this simple discrete problem we take the event space Byy 
to be the power set of f lyy. Now the question of equality makes 
sense as we can evaluate the probability that the two are equal: 

Pr (Y = V) = PY,v(y,v). 

y,v:y=v 

If this probability is 1, then we know that the two random variables 
are in fact equal with probability 1. In any particular example “equal 
with probability 1” does not mean identically equal since they can 
be different on Q with probability zero. 

A random two-dimensional random vector (Y, V ) is simply two 
random variables described on a common probability space. Knowl- 
edge of the individual pmf’s py and py alone is not sufficient in 
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general to determine py,v- More information is needed. Either the 
joint pmf must be given to us or we must be told the definitions of 
the two random variables (two components of the two-dimensional 
binary vector) so that the joint pmf can be derived. For example, if 
we are told that the two random variables Y and V of our example 
are in fact equal, then Pr(T — V) — 1 and py,y (y, v) = 0.5 for y = v, 
and 0 for y ^ v. This experiment can be thought of as flipping two 
coins that are soldered together on the edge so that the result is two 
heads or two tails. 

To see an example of a radically different behavior, consider the 
random variable W : [0, 1) —> {0, 1} by 



W(r) 



Ore [0.0,0.25) IJ[0.5, 0.75) 

1 otherwise. 



(3.10) 



It is easy to see that W is equivalent to the random variables Y 
and V of this section, but W and Y are not equal even though they 
are equivalent and defined on a common experiment. We can easily 
derive the joint pmf for W and Y since the inverse image formula 
extends immediately to random vectors. Now the events involve the 
outputs of two random variables so some care is needed to keep the 
notation from getting out of hand. As in the random variable case, 
any probability measure on a discrete space can be expressed as a 
sum over a pmf on points, that is, 

Py,w{F) = Y PY,w{y,w), (3.11) 

y,w:{y,w)eF 



where F C {0, l} 2 , and where 

PY,w(y,w ) = Py, w ({y,w}) 

= Pr(y = y, W = w)\ y <E {0, 1}, w <E {0, 1}. (3.12) 

As previously observed, pmf’s describing the joint behavior of sev- 
eral random variables are called joint pmf’s and the corresponding 
distribution is called a joint distribution. Thus finding the entire 
distribution only requires finding the pmf, which can be done via the 
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inverse image formula. For example, if (y,w) = (0,0), then 

PY,w( 0, 0) = P({r : Y(r) = 0, W ( r ) = 0}) 

= P([0, 0.5) n«0.0, 0.25) |J [0.5, 0.75))) 

= P([0, 0.25)) = 0.25. 

Similarly it can be shown that 

Pv,w( 0, 1) = py,w( 1, 0) = py,w( 1, 1) = 0.25. 

Joint and marginal pmf’s can both be computed from the un- 
derlying distribution, but the marginals can also be found directly 
from the joint pmf’s without reference to the underlying distribu- 
tion. For example, py(y o) can be expressed as Pyyv(F) by choosing 
F = {(y,w) : y = yo}- Then the pmf formula for Pyyv can be used 
to write 

Py{v o) = Py,w{F) = Y PY,w(y,w ) 

y,w:(y,w)eF 

= 'P Pv,w(yo,w). (3.13) 

Similarly 

pw{ujo)= 'p pv,w(y,w o)- (3.14) 

yefly 

This is an example of the consistency of probability — using different 
pmf’s derived from a common experiment to compute the probability 
of a single event must produce the same result — the marginals must 
agree with the joints. Consistency means that we can find marginals 
by “summing out” joints without knowing the underlying experiment 
on which the random variables are defined. 

This completes the derived distribution of the two random vari- 
ables Y and W (or the single random vector (Y, W)) defined on the 
original uniform pdf experiment. For this particular example the 
joint pmf and the marginal pmf’s have the interesting property 

PY,w(y,w) =py(y)pw(w), (3.15) 

that is, the joint distribution is a product distribution. A product 
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distribution better models our intuitive feeling of experiments such 
as flipping two fair coins and letting the outputs be Y and W be 1 
or 0 according to the coins landing heads or tails. 

In both of these examples cases the joint pmf had to be consistent 
with the individual pmf’s py and py (i.e., the marginal pmf’s) in the 
sense of giving the same probabilities to events where both joint and 
marginal probabilities make sense. In particular, 

Py(v) = Pr(V = y) = Pr(Y = y, V G {0, 1}) 

1 

v=0 

an example of a consistency property. 

The two examples just considered of a random vector (Y, V ) with 
the property Pr(Y = V) = 1 and the random vector (Y, W) with the 
property pyyv(y,w) = PY(y)pw(w) represent extreme cases of two- 
dimensional random vectors. In the first case Y = V and hence be- 
ing told, say, that V = v also tells us that necessarily Y — v. Thus 
V depends on Y in a particularly strong manner and the two ran- 
dom variables can be considered to be extremely dependent. The 
product distribution, on the other hand, can be interpreted as im- 
plying that knowing one of the random variable’s outcome tells us 
absolutely nothing about the other, as is the case when flipping 
two fair coins. Two discrete random variables Y and W will be 
defined to be independent if they have a product pmf, that is, if 
PY,w(y,w) = Py(v)pw( w )- Independence of random variables will be 
shortly related to the idea of independence of events as introduced 
in chapter 2, but for the moment simply observe that it can be inter- 
preted as meaning that knowing the outcome of one random variable 
does not affect the probability distribution of the other. This is a 
very special case of general joint pmf’s. It may be surprising that 
two random variables defined on a common probability space can 
be independent of one another, but this was ensured by the specific 
construction of the two random variables Y and W . 

Note that we have also defined a three dimensional random vec- 
tor (Y, Y, W) because we have defined three random variables on a 
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common experiment. Hence you should be able to find the joint pmf 
Pyvw using the same ideas. 

Note also that in addition to the indirect derivations of a specific 
examples of a two-dimensional random variable, a direct development 
is possible. For example, let {0, l} 2 be a sample space with all of its 
four points having equal probability. Any point r in the sample space 
can be expressed as r = (ro, r i), where r* G {0, 1} for i — 0,1. Define 
the random variables V : {0, l} 2 — > {0, 1} and U : {0, l} 2 — » {0, 1} by 
V(r 0 ,ri) = ro and C/(ro,ri) = r\. You should convince yourself that 

PY,w(y, w) = pv,u(y , w)- y = 0,1; w = 0, 1 

and that py(y) = Pw{y) = Pv(y) = Pu{y), y = 0,1. Thus the ran- 
dom vectors (Y,W) and (V, U) are equivalent. 

In a similar manner pdf’s can be used to describe continuous ran- 
dom vectors, but we shall postpone this step until a later section and 
instead move to the idea of random processes. 



3.1.3 Random Processes 

It is straightforward conceptually to go from one random variable to 
k random variables constituting a /c-dimensional random vector. It is 
perhaps a greater leap to extend the idea to a random process. The 
idea is at least easy to state, but it will take more work to provide 
examples and the mathematical details will be more complicated. A 
random process is a sequence of random variables { X n ; n = 0, 1, . . .} 
defined on a common experiment. It can be thought of as an in- 
finite dimensional random vector. To be more accurate, this is an 
example of a discrete-time, one-sided random process. It is called 
“discrete-time” because the index n which corresponds to time takes 
on discrete values (here the nonnegative integers) and it is called 
“one-sided” because only nonnegative times are allowed. A discrete- 
time random process is also called a time series in the statistics 
literature and is often denoted as {X(n) n — 0, 1, . . .}. Sometimes it 
is denoted by {X[n]} in the digital signal processing literature. Two 
questions might occur to the reader: how does one construct an infi- 
nite family of random variables on a single experiment? How can one 
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provide a direct development of a random process as accomplished 
for random variables and vectors? The direct development might 
appear hopeless since infinite dimensional vectors are involved, but 
it is not. 

The first question is reasonably easy to handle by example. Con- 
sider the usual uniform pdf experiment. Rename the random vari- 
ables Y and W as Xo and Xi, respectively. Consider the follow- 
ing definition of an infinite family of random variables X n : [0, 1) — » 
{0, 1} for n = 0, 1, — Every r G [0, 1) can be expanded as a binary 
expansion of the form 

oo 

r = Ybn(r) 2 ~ n_1 . (3.16) 

71= 0 

This simply replaces the usual decimal representation by a binary 
representation. For example, 1/4 is .25 in decimal and .01 or 
.010000. . . in binary, 1/2 is .5 in decimal and yields the binary se- 
quence .1000. . . , 1/4 is .25 in decimal and yields the binary sequence 
.0100. . . , 3/4 is .75 in decimal and .11000. . . , and 1/3 is .3333. . .in 
decimal and .010101. . . in binary. 

Define the random process by X n (r) = b n {r ), that is, the nth term 
in the binary expansion of r. When n = 0, 1 this reduces to the 
specific Xq and X\ already considered. 

The inverse image formula can be used to compute probabili- 
ties, although the calculations can get messy. Given the simple 
two-dimensional example, however, the pmf’s for random vectors 
X n = (Xo, Xi, . . . , X n _i) can be evaluated as 

p x -(x n ) = Pr(X n = x n ) = 2 -n ; x n G {0, l} n , (3.17) 

where {0, l} n is the collection of all 2 n binary n-tuples. In other 
words, the first n binary digits in a binary expansion for a uniformly 
distributed random variable are all equally probable. Note that in 
this special case the joint pmf’s are again related to the marginal 
pmf’s in a product fashion, that is, 

71—1 

PX™ (x n ) = Y[pXi{Xi), 

7=0 



(3.18) 
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in which case the random variables Xq, Xi, . . . , X n _i are said to be 
mutually independent or, more simply, independent. If a random 
process is such that any finite collection of the random variables 
produced by the process are independent and the marginal pmf’s 
are all the same (as in the case under consideration), the process is 
said to be independent identically distributed or iid for short. An 
iid process is also called a Bernoulli process , although the name is 
sometimes reserved for a binary iid process. 

Something fundamentally important has happened here. If we 
have a random process, then the probability distribution for any 
random vectors formed by collecting outputs of the random process 
can be found (at least in theory) from the inverse image formula. The 
calculations may be a mess, but at least in some cases such as this 
one they are doable. Furthermore these pmf’s are consistent in the 
sense noted before. In particular, if we use (3.13-3.14) to compute 
the already computed pmf’s for Xq and we get the same thing we 
did before, they are each equiprobable binary random variables. If 
we compute the joint pmf for Xo and X\ using (3.17) we also get the 
same joint pmf we got before. This observation likely seems trivial 
at this point (and it should be natural that the math does not give 
any contradictions), but it emphasizes a property that is critically 
important when trying to describe a random process in a more direct 
fashion. 

Suppose now that a more direct model of a random process is 
desired without a complicated construction on an original experi- 
ment. Here the problem is not as simple as in the random variable 
or random vector case where all that was needed was a consistent 
assignment of probabilities and an identity mapping. The solution is 
known as the Kolmogorov extension theorem, named after A.N. Kol- 
mogorov, the primary developer of modern probability theory. The 
theorem will be stated formally later in this chapter, but its compli- 
cated proof will be left to other texts. The basic idea, however, can 
be stated in a few words. If one can specify a consistent family of 
pmf’s px n (% n ) for all n (we have done this for n— 1 and 2), then 
there exists a random process described by those pmf’s. Thus, for 
example, there will exist a random process described by the family 
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of pmf’s px n { xTl ) — 2 _n for x n G {0, l} n for all positive integers n if 
and only if the family is consistent. We have already argued that 
the family is indeed consistent, which means that even without the 
indirect construction previously followed we can argue that there is 
a well-defined random process described by these pmf’s. In partic- 
ular, one can think of a “grand experiment” where Nature selects a 
one-sided binary sequence according to some mysterious probability 
measure on sequences that we have difficulty envisioning. Nature 
then reveals the chosen sequence to us one coordinate at a time, 
producing the process Xq, Xi, X 2 , . . ., The distributions of any finite 
collection of these random variables are known from the given pmf’s 
Px n - Putting this in yet another way, describing or specifying the 
finite-dimensional distributions of a process is enough to completely 
describe the process (provided of course the given family of distribu- 
tions is consistent). 

In this example the abstract probability measure on semiinfinite 
binary sequences is not all that mysterious From our construction 
the sequence space can be considered to be essentially the same as 
the unit interval (each point in the unit interval corresponding to 
a binary sequence) and the probability measure is described by a 
uniform pdf on this interval. 

The second method of describing a random process is by far the 
most common in practice. One usually describes a process by its 
finite sample behavior and not by a construction on an abstract 
experiment. The Kolmogorov extension theorem ensures that this 
works. Consistency is easy to demonstrate for iid processes, but un- 
fortunately it becomes more difficult to verify in more general cases 
(and even more difficult to define and demonstrate for continuous 
time examples). 

Having toured the basic ideas to be explored in this chapter, we 
now delve into the details required to make the ideas precise and 
general. 
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3.2 Random Variables 

We now develop the precise definition of a random variable. As you 
might guess, a technical condition for random variables is required 
because of certain subtle pathological problems that have to do with 
the ability to determine probabilities for the random variable. To 
arrive at the precise definition, we start with the informal definition 
of a random variable that we have already given and then show the 
inevitable difficulty that results without the technical condition. We 
have informally defined a random variable as being a function on a 
sample space. Suppose we have a probability space (f2, T , P). Let / : 
VL — > 5ft be a function mapping the same space into the real line so that 
/ is a candidate random variable. Since the selection of the original 
sample point c o is random, that is, governed by a probability measure, 
the output of our measurement of random variable /(a;) should also 
be random. That is, we should be able to find the probability of 
an “output event’ such as the event “the outcome of the random 
variable / was between a and 6,” that is, the event F C 5ft given by 
F = (a, b). Observe that there are two different kinds of events being 
considered here: 

1. output events or members of the event space of the range or range space 
of the random variable, that is, events consisting of subsets of possible 
output values of the random variable; and 

2. input events or Q events, events in the original sample space of the 
original probability space. 

Can we find the probability of this output event? That is, can we 
make mathematical sense out of the quantity “the probability that 
/ assumes a value in an event F C 5ft”? On reflection it seems clear 
that we can. The probability that / assumes a value in some set of 
values must be the probability of all values in the original sample 
space that result in a value of / in the given set. We will make 
this concept more precise shortly. To save writing we will abbreviate 
such English statements to the form Pr (/ G F), or Pr(T), that is, 
when the notation Pr (F) is encountered it should be interpreted as 
shorthand for the English statement “the probability of an event F v 
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or “the probability that the event F will occur” and not as a precise 
mathematical quantity. 

Recall from chapter 2 that for a subset F of the real line 5ft to 
be an event, it must be in a sigma-field or event space of subsets 
of 5ft. Recall also that we adopted the Borel field £>(5ft) as our basic 
event space for the real line. Hence it makes sense to require that 
our output event F be a Borel set. 

We can now state the question as follows: Given a probability 
space (Q,F,P) and a function / : Q — > 5ft, is there a reasonable 
and useful precise definition for the probability Pr (/ E F) for any 
F E £>(5ft), the Borel field or event space of the real line? Since 
the probability measure P sits on the original measurable space 
(f2,JF) and since / assumes a value in F if and only if uj E £2 
is chosen so that f(uj) E F, the desired probability is obviously 
Pr (/ Ef) = P({cj • /(^) G F}) = P(f~ 1 (F)). In other words, the 
probability that a random variable / takes on a value in a Borel set F 
is the probability (defined in the original probability space) of the set 
of all (original) sample points oj that yield a value /(a;) E F. This, 
in turn, is the probability of the inverse image of the Borel set F un- 
der the random variable /. This idea of computing the probability 
of an output event of a random variable using the original probabil- 
ity measure of the corresponding inverse image of the output event 
under the random variable is depicted in Figure 3.1. 

This natural definition of the probability of an output event of a 
random variable indeed makes sense if and only if the probability 
P(f~ 1 (F)) makes sense, that is, if the subset f~ 1 (F) of corre- 
sponding to the output event F is itself an event, in this case an 
input event or member of the event space F of the original sample 
space. This, then, is the required technical condition: A function / 
mapping the sample space of a probability space (f2,P, P) into the 
real line 5ft is a random variable if and only if the inverse images of 
all Borel sets in 5ft are members of F, that is, if all of the sets 
corresponding to output events (members of £>(5ft)) are input events 
(members of F). Unlike some of the other pathological conditions 
that we have met, it is easy to display some trivial examples where 
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Figure 3.1. The inverse image method: Pr(/ gF)= P({ lo : lu G 
F}) = P(/- 1 (F)) 



the technical condition is not met (as we will see in Example [3.11]). 
We now formalize the definition: 

Given a probability space (f2, P, P) a (real- valued) random variable 
is a function f : VL — > 5ft with the property that if F G £>(5ft), then also 
f-\F) = { W : /H 

Given a random variable / defined on a probability space (f2, P, P), 
the set function 



P f (F) i PC/- 1 ^)) = : /(w) € P}) 

= Pr(/ G P); F G B(») (3.19) 



is well defined since by definition / -1 (P) G P for all P G £>(5ft). In 
the next section the properties of distributions will be explored. 

In some cases one may wish to consider a random variable with a 
more limited range space than the real line, e.g., when the random 
variable is binary. (Recall from appendix A that the range space of 
/ is the image of Q.) If so, 5ft can be replaced in the definition by 
the appropriate subset, say A C 5ft. This is really just a question of 
semantics since the two definitions are equivalent. One or the other 
view may, however, be simpler to deal with for a particular problem. 
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A function meeting the condition in the definition we have given 
is said to be measurable. This is because such functions inherit a 
probability measure on their output events. 

If a random variable has a distribution described by a pmf or a 
pdf with a specific name, then the name is often applied also to the 
random variable; e.g., a continuous random variable with a Gaussian 
pdf is called a Gaussian random variable. 



Examples 

In every case we are given a probability space (f2,A, P). For the 
moment, however, we will concentrate on the sample space ft and 
the random variable that is defined functionally on that space. Note 
that the function must be defined for every value in the sample space 
if it is to be a valid function. On the other hand, the function does 
not have to assume every possible value in its range. 

There is nothing particularly special about the names of the ran- 
dom variables. So far we have used the lower case letter /. On 
occasion we will use other lower case letters such as g and h. As we 
progress we will follow custom and more often use upper case letters 
late in the alphabet, such as A, Y, Z, [/, V, and W. Capital Greek 
letters like O and T are also popular. 

The reader should keep the signal processing interpretation in 
mind while considering these examples. Several very common types 
of signal processing are considered, including quantization, sampling, 
and filtering. 

[3.1] Let £2 = 5ft, the real line, and define the random variable A : 
by X{uj) — u 2 for all lo E fh Thus the random variable 
is the square of the sample point. Note that since the square 
of a real number is always nonnegative, we could replace the 
range £2 by the range space [0, oo) and consider A as a mapping 
A : £2 — » [0, oo). Other random variables mapping J2 into itself 
are Y(uj) = |cj|, Z(J2) = sin(o;), U{u)) = 3 x oj + 321.5, and so on. 
We can also consider the identity mapping as a random variable; 
that is, we can define a random variable W : £2 —> £2 by W(uj) = 
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[3.2] Let = 5ft as in example [3.1] and define the random variable 
f :u>-> {-V,V} by 




+V if r > 0 
— V if r < 0 . 



This example is a variation of the binary quantizer of a real input 
considered in the introduction to chapter 2. With this specific choice 
of output levels it is also called a hard limiter. 

So far we have used uj exclusively to denote the argument of the 
random variable. We can, however, use any letter to denote the 
dummy variable (or argument or independent variable) of the func- 
tion, provided that we specify its domain; that is, we do not need to 
use uj all the time to specify elements of fb r, x, or any other dummy 
variable will do. We will, however, as a convention, always use only 
lower case letters to denote dummy variables. 

When referring to a function, we will use several methods of spec- 
ification. Sometimes we will only give its name, say /; sometimes 
we will specify its domain and range, as in f : Q —> A] sometimes we 
will provide a specific dummy variable, as in /(r); and sometimes we 
will provide the dummy variable and its domain, as in /(r);r G fb 
Finally, functions can be shown with a place for the dummy vari- 
able marked by a period to avoid annointing any particular dummy 
variable as being somehow special, as in /(•). These various nota- 
tions are really just different means of denoting the same thing while 
emphasizing certain aspects of the functions. The only real danger 
of this notation is the same as that of calculus and trigonometry: 
if one encounters a function, say sint, does this mean the sine of a 
particular t (and hence a real number) or does it mean the entire 
waveform of sin t for all t? The distinction should be clear from the 
context, but the ambiguity can be removed, for example, by defining 
something like sin to to mean a particular value and {sin t E 5ft} or 
sin(-) to mean the entire waveform. 



[3.3] Let U be as in example [3.1] and / as in [3.2]. Then the 
function g : defined by g(uj) = f(U(uj)) is also a random 

variable. This relation is often abbreviated by dropping the ex- 
plicit dependence on uj to write g = /([/). More generally, any 
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function of a function is another function, called a “composite” 
function. Thus a function of a random variable is another random 
variable. Similarly, one can consider a random variable formed 
by a complicated combination of other random variables — for 
example, g(uS) = ^sinh _1 [7r x e cos (l^l 3 ' 4 )] . 

[3.4] Let Q = 5ft fc , fc-dimensional Euclidean space. Occasionally 
it is of interest to focus attention on the random variable 
which is defined as a particular coordinate of a vector uj = 
(xo,xi, . . . , Xk-i) £ 5ft fc . Toward this end we can define for each 
i = 0,l,...,fc — 1 a sampling function (or coordinate function) 
Ui : 5ft fc — > 5ft as the following random variable: 

n i(u) = n i((x 0 , . . . , x k _i)) = Xi . 

The sampling functions are also called “projections” of the higher 
dimensional space onto the lower. (This is the reason for the choice 
of II Greek P — not to be confused with the product symbol Yl 
to denote the functions.) 

Similarly, we can define a sampling function for any product space, 
e.g., for sequence and waveform spaces. 

★[3.5] Given a space A , an index set T, and the product space A r , 
define as a random variable, for any fixed t £ T, the sampling 
function H : A J > A as follows: since any to £ AJ is a vector 
or function of the form { x s ; s £ T}, define for each f in T the 
mapping 



U t (u;) = U t ({x s ; s £ T}) = x t • 

Thus, for example, if Q is a one-sided binary sequence space 

n {o, a* = {o, i} 2+ , 

and hence every point has the form c o = (xq, xi, . . .), then 



n 3 ((o, i, i, o, o, o, i, o, i, . . .)) = o. 

As another example, if for all t in the index set, is a replica of 5ft 
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and Q is the space 

= n % 

text 

of all real-valued waveforms {#(£); t G (— oo, oo)}, then for uj — 
{sin t G 5ft}, the value of the sampling function at the particular 
time t = 2tt is 

n 27r ({sin £; £ G 5ft}) = sin 27r = 0 . 

[3.6] Suppose that we have a one-sided binary sequence space 
{0, 1 } Z+ . For any n G {1,2, . . .}, define the random variable Y n 
by Y n (uj) = Y n ((x o,^i, ^ 2 , • • •)) = th e index (time) of occurrence 
of the n th 1 in uj. For example, y 2 ((0, 0, 0, 1, 0, 1, 1, 0, 1, . . .)) =5 
because the second sample to be 1 is x$. 

[3.7] Say we have a one-sided sequence space Vl = 11^2+ where 
5fti is a replica of the real line for each i in the index set. Since 
every uj in this space has the form {#o,£i, • • •} = {xf, i G Z+}, 
we can define for each positive integer n the random variable, 
depending on n, 

n— 1 

S n {u) = S n ({xi; i G Z+}) = n~ l ^ x { 

i= o 

the arithmetic average or “mean” of the first n coordinates of the 
infinite sequence. 

For example, if uj = {1, 1, 1, 1, 1, 1, 1, . . .}, then S n = 1. This aver- 
age is also called a Cesaro mean or sample average or time average 
since the index being summed over often corresponds to time; viz., 
we are adding the outputs at times 0 through n — 1 in the preceding 
equation. Such arithmetic means will later be seen to play a funda- 
mental role in describing the long-term average behavior of random 
processes. The arithmetic mean can also be written using coordinate 
functions as 

71—1 

S n (v) = n -1 y^ri^(u) 

i = 0 



5 



(3.20) 
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which we abbreviate to 

71—1 

S n = n~ l y^II^ (3.21) 

i = 0 

by suppressing the dummy variable or argument oj. Equation (3.21) 
is shorthand for (3.20) and says the same thing: The arithmetic 
average of the first n terms of a sequence is the sum of the first n 
coordinates or samples of the sequence. 

[3.8] As a generalization of the sample average consider weighted 

averages of sequences. Such weighted averages occur in the convo- 
lutions of linear system theory. Let Q be the space where 

5ft* are all copies of the real line. Suppose that {hk\ k — 0,1,2,...} 
is a fixed sequence of real numbers that can be used to form a 
weighted average of the coordinates of a; E Q. Each uj in this space 
has the form u — (. . . , x_i, xo, xi, . . .) = {x^; i E 2} and hence a 
weighted average can be defined for each integer n: 

oo 

y^,(cj) = ^ hkX n —i z . 
k = 0 

Thus the random variable Y n is formed as a linear combination 
of the coordinates of the sequence constituting the point u in the 
double-sided sequence space. This is a discrete time convolution of an 
input sequence with a linear weighting. In linear system theory the 
weighting is called a unit pulse response (or Kronecker delta response 
or 5 response ), and it is the discrete time equivalent of an impulse 
response. Note that we could also use the sampling function notation 
to write Y n , as a weighted sum of the sample random variables. 

[3.9] In a similar fashion, complicated random variables can be 

defined on waveform spaces. For example, let ft = the 

7 ^. 

space of all real-valued functions of time such as voltage-time 
waveforms. For each T, define a time average 

Y t (lo) = Yjr({x(t); t E 5ft}) = T~ l f x{t)dt , 

Jo 
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or given the impulse response h(t) of a causal, linear time- 
invariant system, we define a weighted average 

roc 

Wt{w) = / h(t)x(T — t)dt . 

Jo 

Are these also random variables? They are certainly functions de- 
fined on the underlying sample space, but as one might suspect, the 
sample space of all real- valued waveforms is quite large and contains 
some bizarre waveforms. For example, the waveforms can be suffi- 
ciently pathological to preclude the existence of the integrals cited 
(see chapter 2 for a discussion of this point). These examples are 
sufficiently complicated to force us to look a bit closer at a proper 
definition of a random variable and to develop a technical condition 
that constrains the generality of our definition but ensures that the 
definition will lead to a useful theory. It should be pointed out, 
however, that this difficulty is no accident and is not easily solved: 
waveforms are truly more complicated than sequences because of the 
wider range of possible waveforms due to the uncountability of the 
time variable. Continuous time random processes are more difficult 
to deal with rigorously than are discrete time processes. One can 
write equations such as the integrals and then find that the inte- 
grals do not make sense even in the general Lebesgue sense. Often 
fairly advanced mathematics are required to properly patch up the 
problems. For purposes of simplicity we usually concentrate on se- 
quences (and hence on discrete time) rather than waveforms, and 
we gloss over the technical problems when we consider continuous 
time examples. In chapter 5 we will return to add some rigor to the 
continuous case using the idea of mean square convergence to define 
the integrals. 

One must know the event space being considered in order to deter- 
mine whether or not a function is a random variable. While we will 
virtually always assume the usual event spaces (that is, the power 
set for discrete spaces, the Borel field for the real line or subsets of 
the real line, and the corresponding product event spaces for product 
sample spaces), it is useful to consider some other examples to help 
clarify the basic definition. 
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[3.10] First consider (f where is itself a discrete subset 
of the real line 5ft, e.g., {0, 1} or Z+. If, as usual, we take T to be 
the power set, then any function / : O — » 5ft is a random variable. 
This follows since the inverse image of any Borel set in 5ft must be 
a subset of Q and hence must be in the collection of all subsets 
of Q. 

Thus with the usual event space for a discrete sample space - 
the power set — any function defined on the probability space is a 
random variable. This is why all of the structure of event spaces and 
random variables is not seen in elementary texts that consider only 
discrete spaces: There is no need. 

It should be noted that for any discrete or not, if T is the power 
set, then all functions defined on Q are random variables. This fact 
is useful, however, only for discrete sample spaces since the power set 
is not a useful event space in the continuous case (since we cannot 
endow it with useful probability measures). 

If, however, T is not the power set, some functions defined on Q 
are not random variables, as the following simple example shows: 

[3.11] Let Q be arbitrary, but let T be the trivial sigma- field {f2, 0}. 
On this space it is easy to construct functions that are not random 
variables (and hence are non-measurable functions). For example, 
let Q = {0, 1} and define f(uj) = ca, the identity function. Then 
f-'m) = {0} is not in JF, and hence this simple function is not 
a random variable. In fact, it is obvious that any function that 
assigns different values to 0 and 1 is not a random variable. Note, 
however, that some functions are random variables. 

The problem illustrated by this example is that the input event 
space is not big enough or “fine” enough to contain all input sets 
corresponding to output events. This apparently trivial example 
suggests an important technique for dealing with advanced random 
process theory, especially for continuous time random processes: If 
the event space is not large enough to include the inverse image of 
all Borel sets, then enlarge the event space to include all such events, 
viz., by using the power set as in example [3.10]. Alternatively, we 
might try to force T to contain all sets of the form / _1 (F), F G B( 5ft); 
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that is, make J- the sigma- field generated by such sets. Further treat- 
ment of this subject is beyond the scope of the book. However, it is 
worth remembering that if a sigma-field is not big enough to make a 
function a random variable, it can often be enlarged to be big enough. 
This is not idle twiddling; such a procedure is required for important 
applications, e.g., to make integrals over time defined on a waveform 
space into random variables. 

On a more hopeful tack, if the probability space (f2, P, P ) is chosen 
with f] = 5ft and T = B( 5ft), then all functions / normally encountered 
in the real world are in fact random variables. For example, continu- 
ous functions, polynomials, step functions, trigonometric functions, 
limits of measurable functions, maxima and minima of measurable 
functions, and so on are random variables. It is, in fact, extremely 
difficult to construct functions on Borel spaces that are not ran- 
dom variables. The same statement holds for functions on sequence 
spaces. The difficulty is comparable to constructing a set on the real 
line that is not a Borel set and is beyond the scope of this book. 

So far we have considered abstract philosophical aspects in the 
definition of random variables. We are now ready to develop the 
probabilistic properties of the defined random variables. 



3.3 Distributions of Random Variables 
3.3.1 Distributions 

Suppose we have a probability space (f2,P, P) with a random vari- 
able, X, defined on the space. The random variable X takes values 
on its range space which is some subset A of 5ft (possibly A = 5ft) . The 
range space A of a random variable is often called the alphabet of the 
random variable. As we have seen, since X is a random variable, we 
know that all subsets of Vt of the form X~ 1 (F) = {uj : X{lo) E P}, 
with F E B(A), must be members of T by definition. Thus the set 
function Px defined by 

Px(F) = P(X~\F )) = P({co : X(u) € F}) ; F G B(A) (3.22) 

is well defined and assigns probabilities to output events involving the 
random variable in terms of the original probability of input events in 
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the original experiment. The three written forms in equation (3.22) 
are all read as Pr(X E F) or “the probability that the random vari- 
able X takes on a value in F .” Furthermore, since inverse images 
preserve all set-theoretic operations (see problem A. 12), Px satisfies 
the axioms of probability as a probability measure on (A, 13(A)) 
it is nonnegative, Px(A) = 1, and it is countably additive. Thus Px 
is a probability measure on the measurable space (A, 13(A)). There- 
fore, given a probability space and a random variable A, we have 
constructed a new probability space (A,13(A),Px) where the events 
describe outcomes of the random variable. The probability measure 
Px is called the distribution of X (as opposed to a “cumulative dis- 
tribution function” of X to be introduced later). 

If two random variables have the same distribution, then they are 
said to be equivalent since they have the same probabilistic descrip- 
tion, whether or not they are defined on the same underlying space 
or have the same functional form (see problem 3.30). 

A substantial part of the application of probability theory to prac- 
tical problems is devoted to determining the distributions of random 
variables, that is, performing the “calculus of probability.” One be- 
gins with a probability space. A random variable is defined on that 
space. The distribution of the random variable is then derived, and 
this results in a new probability space. This topic is called variously 
“derived distributions” or “transformations of random variables” and 
is often developed in the literature as a sequence of apparently unre- 
lated subjects. When the points in the original sample space can be 
interpreted as “signals,” then such problems can be viewed as “sig- 
nal processing” and derived distribution problems are fundamental to 
the analysis of statistical signal processing systems. We shall empha- 
size that all such examples are just applications of the basic inverse 
image formula (3.22) and form a unified whole. In fact, this formula, 
with its vector analog, is one of the most important applications of 
probability theory. Its specialization to discrete input spaces using 
sums and to continuous input spaces using integrals will be seen and 
used often throughout this book. 

It is useful to bear in mind both the mathematical and the intuitive 
concepts of a random variable when studying them. Mathematically, 
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a random variable, say X, is a “nice” (= measurable) real- valued 
function defined on the sample space of a probability space (fi, T , P). 
Intuitively, a random variable is something that takes on values at 
random. The randomness is described by a distribution Px, that is, 
by a probability measure on an event space of the real line. When 
doing computations involving random variables, it is usually simpler 
to concentrate on the probability space (A, B(A),Px)-> where A is the 
range space of X, than on the original probability space (fl,P,P). 
Many experiments can yield equivalent random variables, and the 
space (A, 13(A), Px) can be considered as a canonical description 
of the random variable that is often more useful for computation. 
The original space is important, however, for two reasons. First, all 
distribution properties of random variables are inherited from the 
original space. Therefore much of the theory of random variables is 
just the theory of probability spaces specialized to the case of real 
sample spaces. If we understand probability spaces in general, then 
we understand random variables in particular. Second, and more 
important, we will often have many interrelated random variables 
defined on a common probability space. Because of the interrela- 
tionships, we cannot consider the random variables independently 
with separate probability spaces and distributions. We must refer 
to the original space in order to study the dependencies among the 
various random variables (or to consider the random variables jointly 
as a random vector). 

Since a distribution is a special case of a probability measure, in 
many cases it can be induced or described by a probability function, 
i.e., a pmf or a pdf. If a range space of the random variable is discrete 
or, more generally, if there is a discrete subset of the range space A 
such that Px(A) = 1, then there is a pmf, say px, corresponding to 
the distribution Px . The two are related via the formulas 

Px(x) = Px{{x}) , all X € A , (3.23) 

where A is the range space or alphabet of the random variable, and 

Px{F) = Y j Px(x) ; F G B(A) . 

xeF 



(3.24) 
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In (3.23) both quantities are read as Pr(X = x). 

The pmf and the distribution imply each other from (3.23) and 
(3.24), and hence either formula specifies the random variable. 

If the range space of the random variable is continuous and if a 
pdf fx exists, then we can write the integral analog to (3.24): 

Px(F)= [ f x (x)dx ; FeB(A). (3.25) 

J F 

There is no direct analog of (3.23) for a pdf since a pdf is not a 
probability. An approximate analog of (3.23) follows from the mean 
value theorem of calculus. Suppose that F — [x,x + Ax), where Ax 
is extremely small. Then if fx is sufficiently smooth, the mean value 
theorem implies that 

rx+Ax 

P x ([x,x + Ax)) = / fx(a) da « /; x (x)Ax, (3.26) 

J X 

so that if we multiply a pdf fx(%) by a differential Ax, it can be 
interpreted as (approximately) the probability of being within Ax 
of x. It is desirable, however, to have an exact pair of results like 
(3.23) and (3.24) that show how to go both ways, that is, to get 
the probability function from the distribution as well as vice versa. 
From considerations of elementary calculus it seems that we should 
somehow differentiate both sides of (3.25) to yield the pdf in terms of 
the distribution. This is not immediately possible, however, because 
F is a set and not a real variable. Instead to find a pdf from a 
distribution, we use the intermediary of a cumulative distribution 
function or cdf We pause to give the formal definition. 

Given a random variable X with distribution Px , the cumulative 
distribution function or cdf Fx is defined by 

Fx(cv) = Px((—c ©,a]) = P x ({x : x < a}) ; a E 5ft. 

The cdf is seen to represent the cumulative probability of all values 
of the random variable in the infinite interval from minus infinity up 
to and including the real number argument of the cdf. The various 
forms can be summarized as Fx{ot) = Pr(X < a). If the random 
variable X is defined on the probability space (f2,P, P), then by 
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definition 



Fx(a) = P(X~ i ((- oo, a])) = P{{lo : X(u) < a}) . 

If a distribution possesses a pdf, then the cdf and pdf are related 
through the distribution and (3.25) by 

/ a 

fx{x)dx ; a E 5ft. (3.27) 

-oo 

The motivation for the definition of the cdf in terms of our previous 
discussion is now obvious. Since integration and differentiation are 
mutually inverse operations, the pdf is determined from the cdf (and 
hence the distribution) by 



fx(a) = 



dF x (a) 

da 



a 



(3.28) 



where, as is customary, the right-hand side is shorthand for 

dF x (x ) . 



dx 



\x=a ? 



the derivative evaluated at a. Alternatively, (3.28) also follows from 
the fundamental theorem of calculus and the observation that 

Px((a, 6]) = I' f x (x) dx = F x (b) - F x {a ) . (3.29) 

J a 

Thus (3.27) and (3.28) together show how to find a pdf from a distri- 
bution and hence to provide the continuous analog of (3.23). Equa- 
tion (3.23) is useful, however only if the derivative, and hence the 
pdf, exists. Observe that the cdf is always well defined (because the 
semi- infinite interval is a Borel set and therefore an event), regardless 
of whether or not the pdf exists. This is true in both the continuous 
and the discrete alphabet cases. For example, if X is a discrete al- 
phabet random variable with alphabet Z and pmf p X -> then the cdf 



is 



X 



Fx(x) = Yi Px{k) 



(3.30) 



k=— oo 



the analogous sum to the integral of (3.27). Furthermore, for this 
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example, the pmf can be determined from the cdf (as well as the 
distribution) as 

px(x) = F x (x) - F x (x - 1) , (3.31) 

a difference analogous to the derivative of (3.28). 

It is desirable to use a single notation for the discrete and continu- 
ous cases whenever possible. This is accomplished for expressing the 
distribution in terms of the probability functions by using a Stieltjes 
integral , which is defined as 

Px(F) = dF x (x ) = j 1 F (x)dF x (x) 

{ JjPxi x ) if X is discrete 

(3.32) 

fx(x) dx if X has a pdf. 

Thus (3.32) is a combination of both (3.24) and (3.25). 




3.3.2 Mixture Distributions 

More generally, we may have a random variable that has both discrete 
and continuous aspects and hence is not describable by either a pmf 
alone or a pdf alone. For example, we might have a probability space 
(5ft, £>(5ft), P), where P is described by a Gaussian pdf /(w); uj G 5ft. 
The sample point cj G 5ft is input to a soft limiter with output X(<J)— 
a device with input /output characteristic X defined by 

{ -1 uj < — 1 

uj we (-1,1) (3.33) 

1 1 < u 

As long as \cu\ < 1, X(uj) = uj. But for values outside this range, 
the output is set equal to -1 or +1. Thus all of the probability 
density outside the limiting range “piles up” on the ends so that 
Pr(A(cj) = 1) = f UJ>1 f(uj)duj is not zero. As a result X will have a 
mixture distribution, described by a pdf in (—1, 1) and by a pmf at 
the points ±1. 
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Random variables of this type can be described by a distribution 
that is the weighted sum of two other distributions — a discrete 
distribution and a continuous distribution. The weighted sum is an 
example of a mixture distribution, that is, a mixture of probabil- 
ity measures as in example [2.18]. Specifically, let Pi be a discrete 
distribution with corresponding pmf p, and let P 2 be a continuous 
distribution described by a pdf /. For any positive weights ci, C 2 with 
c i + c 2 = 1, the following mixture distribution Px is defined for all 
F G £($»): 



Px(F) = ciPi(F) + c 2 P 2 (F) = ci7>(fc) + c 2 f f(x)dx 

k£F ^ F 

= ci l F (k)p(k) + c 2 J 1 F (r)f(x)dx. (3.34) 



Continuing the example, the output of the limiter of (3.33) has a pmf 
and a pdf. The pmf which places probability one half on ±1, while 
the pdf is Gaussian-shaped for magnitudes less than unity (i.e., it is 
a truncated Gaussian pdf normalized so that the pdf integrates to 
one over the range ( — 1,1)). The constant C 2 is the integral of the 
input Gaussian pdf over (—1, 1) and c\ — 1 — C 2 . Observe that the 
cdf for a random variable with a mixture distribution is 



f‘GL 

F x (a) = C] p(k ) + c 2 / f(x)dx 

k:k<a J°° 

= ciFi(a) + c 2 F 2 (a) , 



(3.35) 



where F\ and are the cdf’s corresponding to P\ and P 2 respec- 
tively. 

The combined notation for discrete and continuous alphabets using 
the Stieltjes integral notation of (3.32) also can be used as follows. 
Given a random variable with a mixture distribution of the form 
(3.34), then 

P X (F) = j dF x (x ) = J l F (ar) dF x (x) ; F G B (&) . (3.36) 
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where 

J 1 p(x) dFx(x) = c\ 1 f(x)p(x) + C 2 J 1 F(x)f(x)dx. (3.37) 

Observe that (3.36) and (3.37) includes (3.32) as a special case where 
either c\ or c 2 is 0. Equations (3.36) and (3.37) provide a general 
means for finding the distribution of a random variable X given its 
cdf, provided the distribution has the form of (3.35). 

All random variables can be described by a cdf. But, more subtly, 
do all random variables have a cdf of the form (3.35)? The answer is 
almost yes. Certainly all of the random variables encountered in this 
course and in engineering practice have this form. It can be shown, 
however, that the most general cdf has the form of a mixture of 
three cdf’s: a continuous and differentiable piece induced by a pdf, 
a discrete piece induced by a pmf, and a third pathological piece. 
The third piece is an odd beast wherein the cdf is something called a 
singular function — the cdf is continuous (it has no jumps as it does 
in the discrete case), and the cdf is differentiable almost everywhere 
(here “almost everywhere” means that the cdf is differentiable at all 
points except some set F for which f F dx = 0), but this derivative 
is 0 almost everywhere and hence it cannot be integrated to find 
a probability! Thus for this third piece, one cannot use pmf’s or 
pdf’s to compute probabilities. The construction of such a cdf is 
beyond the scope of this text, but we can point out for the curious 
that the typical example involves placing probability measures on 
the Cantor set that was considered in problem 2.19. At any rate, as 
such examples almost never arise in practice, we shall ignore them 
and henceforth consider only random variables for which (3.36) and 
(3.37) hold. 

While the general mixture distribution random variable has both 
discrete and continuous pieces, for pedagogical purposes it is usually 
simplest to treat the two pieces separately - i.e., to consider random 
variables that have either a pdf or a pmf. Hence we will rarely con- 
sider mixture distribution random variables and will almost always 
focus on those that are described either by a pmf or by a pdf and 
not both. 

To summarize our discussion, we will define a random variable to 
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be a discrete, continuous, or mixture random variable depending on 
whether it is described probabilistically by a pmf, pdf, or mixture as 
in (3.36) and (3.37) with C\,C 2 > 0. 

We note in passing that some texts endeavor to use a uniform ap- 
proach to mixture distributions by permitting pdf’s to possess Dirac 
delta or impulse functions. The purpose of this approach is to permit 
the use of the continuous ideas in discrete cases, as in our limiter out- 
put example. If the cdf is differentiated, then a legitimate pdf results 
(without the need for a pmf) if a delta function is allowed at the two 
discontinuities of the cdf. As a general practice we prefer the Stielt- 
jes notation, however, because of the added notational clumsiness 
resulting from using pdf’s to handle inherently discrete problems. 
For example, compare the notation for the geometric pmf with the 
corresponding pdf that is written using Dirac delta functions. 



3.3.3 Derived Distributions 

[3.12] Let ( Vt,fF,P ) be a discrete probability space with Vt a dis- 
crete subset of the real line and T the power set. Let p be the 
pmf corresponding to P, that is, 



p( u) = P({ cj}) , all oj G . 

(Note: There is a very subtle possibility for confusion here. p( oj) 
could be considered to be a random variable because it satisfies 
the definition for a random variable. We do not use it in this 
sense, however; we use it as a pmf for evaluating probabilities 
in the context given. In addition, no confusion should result 
because we rarely use lower case letters for random variables.) 
Let A be a random variable defined on this space. Since the 
domain of X is discrete, its range space, A, is also discrete (refer 
to the definition of a function to understand this point) . Thus the 
probability measure Px must also correspond to a pmf, say px', 
that is, (3.23) and (3.24) must hold. Then we can derive either 
the distribution Px or the simpler pmf px in order to complete 




128 



Random Objects 



a probabilistic description of X. Using (3.22) yields 

Px(x) = Px{{x}) = PiX-'ttx})) = pH • ( 3 - 38 ) 

oj\X(uj)=x 

Equation (3.38) provides a formula for computing the pmf and 
hence the distribution of any random variable defined on a discrete 
probability space. As a specific example, consider a discrete proba- 
bility space (O,^ 7 , P ) with ft = i? + , T the power set of D, and P the 
probability measure induced by the geometric pmf. Define a random 
variable Y on this space by 

N f 1 if io even 
Y{U) = { 0 if u, odd 

where we consider 0 (which has probability zero under the geometric 
pmf) to be even. Thus we have a random variable Y : Z+ — > {0, 1}. 
Using the formula (3.38) for the pmf for Y (a;) = 1 results in 

pH 1 ) = 52 ( 1 ~p) k ~ 1 p= 52 ( 1_ p) fc_1 p 

<jJ\uj even k= 2 , 4 ,... 

oo oo 

= 7T P -r ~ p ^ k = H 1 - _ P^ k 

^ k = 1 k = 0 

= (1 -P) = 1 ~P 

^1 — (1 — p ) 2 2 — p 

where we have used the standard geometric series summation formula 
(in a thinly disguised variation of an example of section 2.2.4). We 
can calculate the remaining point in the pmf from the axioms of 
probability: py(0) = 1 — py(l). Thus we have found a non-obvious 
derived distribution by computing a pmf via (3.38), a special case 
of (3.22). Of course, given the pmf, we could now calculate the 
distribution from (3.24) for all four sets in the power set of {0, 1}. 

[3.13] Say we have a probability space (5ft, 23(3?), P) where P is 
described by a pdf g\ that is, g is a nonnegative function of the 
real line with total integral 1 and 



P(F) = 




g(r) dr ; F G B(») . 
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Suppose that we have a random variable X : 5ft — > 5ft. We can use 
(3.22) (3.24) to write a general formula for the distribution of X : 

P X (F) = P(X~\F)) = [ g(r ) dr . 

Jr:X(r)eF 

Ideally, however, we would like to have a simpler description of X. 
In particular, if X is a “reasonable function” it should have either a 
discrete range space (e.g., a quantizer) or a continuous range space 
(or possibly both, as in the general mixture case) . If the range space 
is discrete, then X can be described by a pmf, and the preceding 
formula (with the requisite change of dummy variable) becomes 

Px{x) = Px{{x}) = / g(r) dr . 

J r : X(r)=x 

If, however, the range space is continuous, then there should exist 
a pdf for X, say fx, such that (3.25) holds. How do we find this 
pdf? As previously discussed, to find a pdf from a distribution, we 
first find the cdf Fx- Then we differentiate the cdf with respect to 
its argument to obtain the pdf. As a nontrivial example, suppose 
that we have a probability space (5ft, £>(5ft), P ) with P the probability 
measure induced by the Gaussian pdf. Define a random variable W : 
5ft — > 5ft by W(r) = r 2 ; r G 5ft. Following the described procedure, we 
first attempt to find the cdf F\y for W : 

F w (w) = Pr(W <w) = P{{uj : W( u) = lo 2 < w}) 

= P([-w 1/2 ,w 1/2 }); if w > 0 . 

The cdf is clearly 0 if w < 0. Since P is described by a pdf, say g 
(the specific Gaussian form is not yet important), then 

rw 1 / 2 

F w (w)= / g(r) dr . 

J-W V2 

If one should now try to plug in the specific form for the Gaussian 
density, one would quickly discover that no closed form solution ex- 
ists. Happily, however, the integral does not have to be evaluated 
explicitly — we need only its derivative. Therefore we can use the 
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following handy formula from elementary calculus for differentiating 
the integral: 



_d_ 

dw 




g(r) dr = g(b(w)) 



db(w) 

dw 



g(a(w)) 



da(w) 

dw 



Application of the formula yields 

f w (w) = g{w 1/2 ) ^ • 



(3.39) 



(3.40) 



The final answer is found by plugging in the Gaussian form of g. For 
simplicity we do this only for the special case where m = 0. Then g 
is symmetric; that is, g{w) = g(—w), so that 

fw{w) = w~ 1/2 g(w 1/2 ) ; w £ [0, oo) , 



and finally 



fw(w) 





w G [0, oo) 



This pdf is called a chi-squared pdf (with one degree of freedom). 
Observe that the functional form of the pdf is valid only for the given 
domain. By implication the pdf is zero outside the given domain 
in this example, negative values of W cannot occur. One should 
always specify the domain of the dummy variable of a pdf; otherwise 
the description is incomplete. 

In practice one is likely to encounter the following trick for deriving 
densities for certain simple one-dimensional problems. The approach 
can be used whenever the random variable is a monotonic (increasing 
or decreasing) function of its argument. Suppose first that we have 
a random variable Y = g(X), where g is a monotonically increasing 
function and that g is differentiable. Since g is monotonic, it is 
invertible and we can write X = g -1 (T), that is, x = g~ 1 {y) is the 
value of x for which g(x) = y. Then 



Fy(y) = PrOPO <y) = Pr(X < g~\y)) 

, r9~ 1 (y) 

= Fx(g (y))= fx{x) dx. 

J — oo 
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From (3.39) the density can be found as 

fv(y ) = -i-Fviy) = 

dy dy 

A similar result can be derived for a monotonically decreasing g 
except that a minus sign results. The final formula is that if Y — 
g(X) and g is monotone, then 

fy(y) = fx(g~ l (y)) |^— M|. (3.41) 

dy 

This result is a one-dimensional special case of the so-called Jaco- 
bian approach to derived distributions. The result could be used to 
solve the previous problem by separately considering negative and 
nonnegative values of the input r since r 2 is a monotonic increasing 
function for nonnegative r and monotonic decreasing for negative r. 
As in this example, the direct approach from the inverse image for- 
mula is often simpler than using the Jacobian “shortcut,” unless one 
is dealing with a monotonic function. 

It can be seen that although the details may vary from application 
to application, all derived distribution problems are solved by the 
general formula (3.22). In some cases the solution will result in a 
pmf; in others the solution will result in a pdf. 

To review the general philosophy, one uses the inverse image for- 
mula to compute the probability of an output event. This is ac- 
complished by finding the probability with respect to the original 
probability measure of all input events that result in the given out- 
put event. In the discrete case one concentrates on output events of 
the form X — x and thereby finds a pmf. In the continuous case, one 
concentrates on output events of the form X < x and thereby finds 
a cdf. The pdf is then found by differentiating. 

[3.14] As a final example of derived distributions, suppose that 
we are given a probability space (D, B(Q), P) with Vt C 5ft. De- 
fine the identity mapping X : D — > D by X{uf) = uj. The identity 
mapping on the real line with the Borel field is always a random 
variable because the measurability requirement is automatically 
satisfied. Obviously the distribution Px is identical to the origi- 
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nal probability measure P. Thus all probability spaces with real 
sample spaces provide examples of random variables through the 
identity mapping. A random variable described in this form in- 
stead of as a general function (not the identity mapping) on an 
underlying probability space is called a “directly given” random 
variable. 



3.4 Random Vectors and Random Processes 

Thus far we have emphasized random variables, scalar functions on 
a sample space that assume real values. In some cases we may wish 
to model processes or measurements with complex values. Complex 
outputs can be considered as two-dimensional real vectors with the 
components being the real and imaginary parts or, equivalently, the 
magnitude and phase. This special case can be equally well described 
as a single complex- valued random variable or as a two-dimensional 
random vector. 

More generally, we may have k— dimensional real vector outputs. 
A random variable is a real- valued function on a sample space (with 
a technical condition), that is, a function mapping a sample space 
into the real line 5ft. The obvious random vector definition is a vector- 
valued function definition. Under this definition, a random vector is 
a vector of random variables, a function mapping the sample space 
into 5ft fc instead of 5ft. Yet even more generally, we may have vectors 
that are not finite dimensional, e.g., sequences and waveforms whose 
values at each time are random variables. This is essentially the 
definition of a random process. Fundamentally speaking, both ran- 
dom vectors and random processes are simply collections of random 
variables defined on a common probability space. 

Given a probability space (f],X, P), a finite collection of ran- 
dom variables {X^; i = 0, 1, . . . , k — 1} is called a random vector. 
We will often denote a random vector in boldface as X. Thus a 
random vector is a vector-valued function X : — >• 5 R k defined by 

X = (Xo, Xi, . . . , Xfc_ i) with each of the components being a ran- 
dom variable. It is also common to use an ordinary non-boldface X 
and let context indicate whether X has dimension 1 or not. Another 
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common notation for the /c-dimensional random vector is X k . Each 
of these forms is convenient in different settings, but we begin with 
the boldface notation in order to distinguish the random vectors from 
scalar random variables. As we progress, however, the non-boldface 
notation will be used with increasing frequency to match current 
style in the literature. The boldface notation is still found, but it is 
far less common then it used to be. When vectors are used in lin- 
ear algebra manipulations with matrices and other vectors, we will 
assume that they are column vectors so that strictly speaking the 
vector should be denoted X = (Xo, Xi, . . . , Xfc_i)*, where t denotes 
transpose.. 

A slightly different notation will ease the generalization to ran- 
dom processes. A random vector X= (Xo, Xi, . . . , X^_i) can be 
defined as an indexed family of random variables {X^; i E T} where 
T is the index set Z^ = {0, 1, . . . , k — 1}. The index set in some 
examples will correspond to time; e.g., X^ is a measurement on an 
experiment at time i for k different times. We get a random process 
by using the same basic definition with an infinite index set, which 
almost always corresponds to time. A random process or stochas- 
tic process is an indexed family of random variables {X^; t E T} or, 
equivalently, {X(£); t E T}, defined on a common probability space 
(f2,P, P). The process is said to be discete time if T is discrete, 
e.g., Z + or Z , and continuous time if the index set is continuous, 
e.g., 5ft or [0,oo). A discrete time random process is often called a 
time series. It is said to be discrete alphabet or discrete amplitude 
if all finite-length random vectors of random variables drawn from 
the random process are discrete random vectors. The process is said 
to be continuous alphabet or continuous amplitude if all finite-length 
random vectors of random variables drawn from the random process 
are continuous random vectors. The process is said to have a mixed 
alphabet if all finite-length random vectors of random variables drawn 
from the random process are mixture random vectors. 

Thus a random process is a collection of random variables indexed 
by time, usually into the indefinite future and sometimes into the in- 
finite past as well. For each value of time £, Xt or X(t) is a random 
variable. Both notations are used, but Xt or X n is more common for 
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discrete time processes whereas X (t) is more common for continuous 
time processes. It is useful to recall that random variables are func- 
tions on an underlying sample space ft and hence implicitly depend 
on uj G O. Thus a random process (and a random vector) is actually a 
function of two arguments, written explicitly asX(^); t G T, uj G Q 
(or Xt(uj) — we use the first notation of the moment). Observe that 
for a fixed value of time, X(t, uj) is a random variable whose value de- 
pends probabilistically on uj. On the other hand, if we fix uj and allow 
t to vary deterministically, we have either a sequence (T discrete) or a 
waveform (T continuous). If we fix both t and a;, we have a number. 
Overall we can consider a random process as a two-space mapping 
X : Q x T K or as a one-space mapping X : Q — > 5ft 7 " from sample 
space into a space of sequences or waveforms. 

There is a common notational ambiguity and hence confusion when 
dealing with random processes. It is the same problem we encoun- 
tered with functions in the context of random variables at the be- 
ginning of the chapter. The notation X(t) or Xt usually means a 
sample of the random process at a specified time £, i.e., a random 
variable, just as sint means the sine of a specified value t. Often in 
the literature, however, the notation is used as an abbreviation for 
{X(t); t G T} or { X t ; t G T}, that is, for the entire random process 
or family of random variables. The abbreviation is the same as the 
common use of sint to mean {sin t G (—00, 00)}, that is, the entire 
waveform and not just a single value. In summary, the common (and 
sometimes unfortunate) ambiguity is in whether or not the dummy 
variable t means a specific value or is implicitly allowed to vary over 
its entire domain. Of course, as noted at the beginning of the chap- 
ter, the problem could be avoided by reserving a different notation to 
specify a fixed time value, say to, but this is usually not done to avoid 
a proliferation of notation. In this book we will attempt to avoid the 
potential confusion by using the abbreviations {X(t)} and {X t } for 
the random processes when the index set is clear from context and 
reserving the notation X (t) and Xt to mean the t th random variable 
of the process, that is, the sample of the random process at time t. 
The reader should beware in reading other sources, however, because 
this sloppiness will undoubtedly be encountered at some point in the 
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literature; when this happens one can only hope that the context will 
make the meaning clear. 

There is also an ambiguity regarding the alphabet of the random 
process. If X(t) takes values in At, then strictly speaking the al- 
phabet of the random process is Y\ t eT the space of all possible 
waveforms or sequences with coordinate taking values in At- If all of 
the At are the same say At = A, this process alphabet is . In this 
case, however, the alphabet of the process is commonly said to be 
simply A , the set of values from which all of the coordinate random 
variables are drawn. We will frequently use this convention. 



3.5 Distributions of Random Vectors 

Since a random vector takes values in a space 5ft one might expect 
that the events in this space, that is, the members of the event space 
should inherit a probability measure from the original prob- 
ability space. This is in fact true as one would expect by analogy to 
scalar random variables. Also analogous to the case of a random vari- 
able, the probability measure is called a distribution and is defined 
as 

Px(F) = Ppt-'iF)) = P({co : X(w) G F}) 

= P({u: (X 0 (u),X 1 (u),...,X k - 1 (u>)) €F}), F <E B(M) k , 

where the various forms are equivalent and all stand for Pr(X G F). 
Equation (3.42) is the vector generalization of the inverse image equa- 
tion (3.22) for random variables. Hence (3.42) is the fundamental 
formula for deriving vector distributions, that is, probability distri- 
butions describing random vector events. Keep in mind that the 
random vectors might be composed of a collection of samples from a 
random process. 

By definition the distribution given by (3.22) is valid for each com- 
ponent random variable. This does not immediately imply, how- 
ever, that the distribution given by (3.42) for events on all compo- 
nents together is valid. As in the case of a random variable, the 
distribution will be valid if the output events F G B(fR) k have in- 
verse images under X that are input events, that is, if X _1 (F) G J~ 




136 



Random Objects 



for every F G B($V) k . The following subsection treats this sub- 
tle issue in further detail, but the only crucial point for our pur- 
poses is the following. Given that we consider real-valued vectors 
X = (Xo,Xi, . . . , knowing that each coordinate X{ is a ran- 

dom variable (i.e., X~ l (F) for each real event F) guarantees that 
X -1 (F) G T for every F G B(^R) k and hence the basic derived distri- 
bution formula is valid for random vectors. 



3.5.1 ★ Multidimensional Events 

From the discussion following example [2.11] we can at least resolve 
the issue for certain types of output events, viz., events that are 
rectangles. Rectangles are special events in that the values assumed 
by any component in the event are not constrained by any of the other 
components (compare a two-dimensional rectangle with a circle, as 
in problem 2.36). Specifically F G B($V) k is a rectangle if it has the 
form 

k — 1 k — 1 

F = {x : Si € Ff, i = 0, 1, . . . , k - 1} = P) {x : s t € F t } = JJ F, , 

i= 0 z=0 

where all F{ G £>(5ft); i = 0, 1, . . . , k — 1 (refer to Figure 2.3(d) for a 
two-dimensional illustration of such a rectangle). Because inverse 
images preserve set operations A. 12, the inverse image of F can be 
specified as the intersection of the inverse images of the individual 
events: 

k - 1 

X-'(F) = {uj : Xi(u) € Fi- i = 0, 1, . . . , k — 1} = f| X- ] (F,) . 

i = 0 

Since the X{ are each random variables, the inverse images of the 
individual events X~ l (Fi) must all be in T. Since T is an event 
space, the intersection of events must also be an event, and hence 
X _1 (F) is indeed an event. 

Thus we conclude that the distribution is well defined for rect- 
angles. As to more general output events, we simply observe that 
a result from measure theory ensures that if (1) inverse images of 
rectangles are events and (2) rectangles are used to generate the out- 
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put event space, then the inverse images of all output events are 
events. These two conditions are satisfied by our definition. Thus 
the distribution of the random vector X is well defined. Although 
a detailed proof of the measure theoretic result will not be given, 
the essential concept can be given: Any event in T can be approxi- 
mated arbitrarily closely by finite unions of rectangles (e.g., a circle 
can be approximated by lots of very small squares). The union of 
the rectangles is an event. Finally, the limit of the events as the 
approximation gets better must also be an event. 



3.5.2 Multidimensional Probability Functions 

Given a probability space (f],P, P) and a random vector X : Vl — > 
we have seen that there is a probability measure P x that the 
random vector inherits from the original probability space. With 
the new probability measure we define a new probability space 
/3(5ft) fc , Px). As in the scalar case, the distribution can be de- 
scribed by probability functions, that is, cdf’s and either pmf’s or 
pdf’s (or both). If the random vector has a discrete range space, 
then the distribution can be described by a multidimensional pmf 
Px(x) = Px({x}) = Pr(X = x) as 

Px(F) = J^Px(x) = Yi Px 0t x 1 ,...,x k - 1 (xo,xi,...,x k -i) , 

where the last form points out the economy of the vector notation 
of the previous line. If the random vector X has a continuous range 
space, then in a similar fashion its distribution can be described by 
a multidimensional pdf / x with P X (P) = / F /x(x)dx. In order to 
derive the pdf from the distribution, as in the scalar case, we use a 
cdf. 

Given a /c— dimensional random vector X, define its cumulative 
distribution function P x by 

Px(a) = -Px 0 ,Xi,...,x fc _ 1 (ao,air>- ■ jOfe-i) 

= P x ({x : Xi < a.i\ i = 0, 1, . . . , k - 1}) . 



In English, P x (x) = Pr(JQ < i — 0, 1, . . . , k — 1). Note that the 
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cdf for any value of its argument is the probability of a special kind 
of rectangle. For example, if we have a two-dimensional random 
vector (X, y), then the cdf Fx,y( a iP) = P r (^ < cr, X” < /3) is the 
probability of the semi-infinite rectangle {(x,y) : x < <a, y < /?}. 

Observe that we can also write this probability in several other 
ways, e.g., 



*x(x) = 



Pyi 



= p({u) : Xi(u) < Xi\ i = 0, 1, . . . , k - 1}) 






Since integration and differentiation are inverses of each other, it 
follows that 



fx o,Xi ,...,W/ e _i (' 2 'Ch • • • 5 %k—\) 

Qk 



dx^dxi . . . dxk-i 



(*^0? X \ , . . . , l) . 



As with random variables, random vectors can, in general, have 
discrete and continuous parts with a corresponding mixture distri- 
bution. We will concentrate on random vectors that are described 
completely by either pinks or pdf’s. Also as with random variables, 
we can always unify notation using a multidimensional Stieltjes in- 
tegral to write 



Px(F) = ^ciFx(x) ; 

where the integral is defined as the usual integral if X is described by 
a pdf, as a sum if X is described by a pmf, and by a weighted average 
if X has both a discrete and a continuous part. Random vectors 
are said to be continuous, discrete, or mixture random vectors in 
accordance with the above analogy to random variables. 
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3.5.3 Consistency of Joint and Marginal Distributions 

By definition a random vector X = (Xo, Xi, . . . , X&_i) is a collection 
of random variables defined on a common probability space (f2, X, P ). 
Alternatively, X can be considered to be a random vector that takes 
on values randomly as described by a probability distribution P x , 
without explicit reference to the underlying probability space. Ei- 
ther the original probability measure P or the induced distribution 
P x can be used to compute probabilities of events involving the ran- 
dom vector. P x in turn may be induced by a pmf p x or a pdf / x . 
From any of these probabilistic descriptions we can find a probabilis- 
tic description for any of the component random variables or any 
collection of thereof. That is, P x is evaluated on rectangles of the 
form {x = (xo, . . . , Xk-i) : £ G} for any G G B(fR) as 

P Xi (G) = P x ({x : Xi G G}) , G G B(R) . (3.42) 

For example, given a value of i in {0, 1, — 1}, the distribution 

of the random variable Xi is found by evaluating the distribution 
P x for the random vector on one-dimensional rectangles where only 
the component Xi is constrained to lie in some set — the rest of the 
components can take on any value. Of course the probability can 
also be evaluated using the underlying probability measure P via 
the usual formula 



P Xi (G) = P{Xr\G)). 

Alternatively, we can consider this a derived distribution problem 
on the vector probability space (5ft fc , P x ) using a sampling 

function Eh : ?R k — > 5ft as in example [3.4]. Specifically, let Eh(X) = 
Xi. using (3.22) we write 

Pui(G) = PxCn-^G)) = Px({x : Xi € G}) . (3.43) 

The two formulas (3.42) and (3.43) demonstrate that Eh and Xi are 
equivalent random variables, and indeed they correspond to the same 
physical events — the outputs of the i th coordinate of the random 
vector X. They are related through the formula Eh(X(a;)) = X{(uj). 
Intuitively, the two random variables provide different models of the 
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same thing. As usual, which is “better” depends on which is the 
simpler model to handle for a given problem. 

Another fundamental observation implicit in these ruminations is 
that there are many ways to compute the probability of a given 
event such as “the zth coordinate of the random vector X takes on 
a value in an event P,” and all these methods must yield the same 
answer because they all can be referred back to a common definition 
in terms of the underlying probability measure P. This is called con- 
sistency] the various probability measures (P, Pjq, and P*) are all 
consistent in that they assign the same number to any given phys- 
ical event for which they all are defined. In particular, if we have 
a random process {Xt\ t G T}, then there is an infinite number of 
ways we could form a random vector (X^ 0 , X tl , . . . , X tk _ 1 ) by choos- 
ing a finite number k and sample times topi, . . . pk-i and each of 
these would result in a corresponding /c-dimensional probability dis- 
tribution P Xt ,Xt • The calculus derived from the axioms of 

probability implies that all of these distributions must be consistent 
in the same sense, i.e., all must yield the same answer when used to 
compute the probability of a given event. 

The distribution Px % of a single component X{ of a random vector 
X is referred to as a marginal distribution , while the distribution 
Px of the random vector is called a joint distribution.. As we have 
seen, joint and marginal distributions are related by consistency with 
respect to the original probability measure, i.e., 



P Xi (G) = P x ({x : Xi G G}) = P({uj : X^u) G G}) = Pr (X, G G). 

(3.44) 

For the cases where the distributions are induced by pmf’s 
(marginal pmf’s and joint pmf’s) or pdf’s (marginal pdf’s or joint 
pdf’s), the relation becomes, respectively, 



PXi(a) = 



£ 



PXq,Xi {xo i x i , . . . , Xi— \ , rr, , . . . , Xk—\) 
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or 



f.Xi(a) = dx o . . . dxi-idx i+ i . . . dx k - i 

f Xo,...,Xk_l ('^Cb • • • 5 1? 1 ? • • • 5 *^/c— l) • 

That is, one sums or integrates over all of the dummy variables cor- 
responding to the unwanted random variables in the vector to obtain 
the pmf or pdf for the random variable X{. The two formulas look 
identical except that one sums for discrete random variables and the 
other integrates for continuous ones. We repeat the fact that both 
formulas are simple consequences of (3.44). 

One can also use (3.42) to derive the cdf of X{ by setting G = 
(— oo, a}. The cdf is 

Fxi(a) =F x (oo,oo,...,oo,a,cx),...,oo) , 

where the a appears in the i th position. This equation states that 
Pr(X^ < a) = Pr (Xi < a and Xj < oo), all j ^ i. The expressions 
for pmf ’s and pdf’s also can be derived from the expression for cdf’s. 

The details of notation with k random variables can cloud the 
meaning of the relations we are discussing. Therefore we rewrite 
them for the special case of k = 2 to emphasize the essential form. 
Suppose that (X,Y) is a random vector. Then the marginal distri- 
bution of X is obtained from the joint distribution of X and Y by 
leaving Y unconstrained, i.e., as in equation (3.42): 

Px(F) = P x , Y ({(x,y) : x G F}) ; F € B(l ft) . 

Furthermore, the marginal cdf of X is 



Fxipt) = Fx,y oo) . 

If the range space of the vector (X, Y) is discrete, the marginal pmf 
of X is 

px(x) = ^pxy{x, y) . 
y 



If the range space of the vector (X, T) is continuous and the cdf is 
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differentiable so that fx.y{ x -,y) exists, the marginal pdf of X is 

/ oo 

fx,y{x,y)dy , 

-OO 

with similar expressions for the distribution and probability functions 
for the random variable Y . 

In summary, given a probabilistic description of a random vec- 
tor, we can always determine a probabilistic description for any of 
the component random variables of the random vector. This follows 
from the consistency of probability distributions derived from a com- 
mon underlying probability space. It is important to keep in mind 
that the opposite statement is not true. As considered in the intro- 
duction to this chapter, given all the marginal distributions of the 
component random variables, we cannot find the joint distribution 
of the random vector formed from the components unless we further 
constrain the problem. This is true because the marginal distribu- 
tions provide none of the information about the interrelationships of 
the components that is contained in the joint distribution. 

In a similar manner we can deduce the distributions or probability 
functions of “sub- vectors” of a random vector, that is, if we have 
the distribution for X = (Xo,Xi, . . . , Xk-i) and if k is big enough, 
we can find the distribution for the random vector (Xi,X2) or the 
random vector (X5, X10, X15), and so on. Writing the general for- 
mulas in detail is, however, tedious and adds little insight. The 
basic idea, however, is extremely important. One always starts with 
a probability space (f2,P, P) from which one can proceed in many 
ways to compute the probability of an event involving any combi- 
nation of random variables defined on the space. No matter how 
one proceeds, however, the probability computed for a given event 
must be the same. In other words, all joint and marginal probability 
distributions for random variables on a common probability space 
must be consistent since they all follow from the common underlying 
probability measure. For example, after finding the distribution of 
a random vector X, the marginal distribution for the specific com- 
ponent Xi can be found from the joint distribution. This marginal 
distribution must agree with the marginal distribution obtained for 
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Xi directly from the probability space. As another possibility, one 
might first find a distribution for a subvector containing Xi, say the 
vector Y = (X*_i, JQ, JQ + i). This distribution can be used to find 
the marginal distribution for Xi. All answers must be the same since 
all can be expressed in the form P(X~ 1 (F)) using the original prob- 
ability space must be consistent in the sense that they agree with 
one another on events. 

Examples: Marginals from Joint 

We now give examples of the computation of marginal probability 
functions from joint probability functions. 

[3.15] Say that we are given a pair of random variables X and Y 
such that the random vector (A, Y) has a pmf of the form 

Px,y(x, y) = r(x)q(y) , 

where r and q are both valid pmf’s. In other words, pxy is a 
product pmf. Then it is easily seen that 

px{x) = EjPx,Y{x,y) = 'Yjr{x)q{y) 
y y 

= r(*)5» = r{x) . 
y 

Thus in the special case of a product distribution, knowing the 
marginal pmf’s is enough to know the joint distribution. 

[3.16] Consider flipping two fair coins connected by a piece of rub- 
ber that is fairly flexible. Unlike the example where the coins 
were soldered together, it is not certain that they will show the 
same face; it is, however, more probable. To quantify the pmf, 
say that the probability of the pair (0,0) is .4, the probability of 
the pair (1,1) is .4, and the probabilities of the pairs (0,1) and 
(1,0) are each .1. As with the soldered-coins case, this is clearly 
not a product distribution, but a simple computation shows that 
as in example [3.15], px and py both place probability 1/2 on 
0, and 1. Thus this distribution, the soldered-coins distribution, 
and the product distribution of example [3.15] all yield the same 
marginal pmf’s! The point again is that the marginal probability 
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functions are not enough to describe a vector experiment. We 
need the joint probability function to describe the interrelations 
or dependencies among the random variables. 

[3.17] A gambler has a pair of very special dice: the sum of the 
two dice comes up as seven on every roll. Each die has six faces 
with values in A = {1, 2, 3, 4, 5, 6}. All combinations have equal 
probability; e.g., the probability of a one and a six has the same 
probability as a three and a four. Although the two dice are 
identical, we will distinguish between them by number for the 
purposes of assigning two random variables. The outcome of the 
roll of the first die is denoted X and the outcome of the roll of the 
second die is called Y so that (X,Y) is a random vector taking 
values in A 2 , the space of all pairs of numbers drawn from A. 
The joint pmf of X and Y is 

Pxy(x,y ) = L x + y = 7, (x,y) G A 2 . 

0 

The pmf of X is determined by summing the pmf with respect to 
y. However, for any given iGi, the value of Y is determined: 
viz., Y = 7 — X. Therefore the pmf of X is 

px(x) = 1/6, x G A . 

Note that this pmf is the same as one would derive for the roll of 
a single unbiased die! Note also that the pmf for Y is identical with 
that for X. Obviously, then, it is impossible to tell that the gambler 
is using unfair dice as a pair from looking at outcomes of the rolls of 
each die alone. The joint pmf cannot be deduced from the marginal 
pmf’s alone. 

[3.18] Let (X, Y) be a random vector with a pdf that is constant 
on the unit disk in the XY plane; i.e., 

fxy(x,y) = C,x 2 + y 2 < 1 . 

The constant C is determined by the requirement that the pdf 
integrate to 1; i.e., 

/ C dxdy — 1 . 

J x 2 +y 2 < 1 
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Since this integral is just the area of a circle multiplied by (7, we 
have immediately that C = 1/n. For the moment, however, we 
leave the joint pdf in terms of C and determine the pdf of X in 
terms of C by integrating with respect to y : 



fx(x) 




x 2 < 1 . 



Observe that we could now also find C by a second integration: 




7rC= 1 , 



or C = 7r l . Thus the pdf of X is 



fx{pc) = 2 tt 1 y/l — x 2 , x 2 < 1 . 



By symmetry Y has the same pdf. Note that the marginal pdf 
is not constant , even though the joint pdf is. Furthermore, it is 
obvious that it would be impossible to determine the joint density 
from the marginal pdf’s alone. 

[3.19] Consider the two-dimensional Gaussian pdf of exam- 
ple [2.17] with k = 2, m = (0, 0)*, and A = {A(i, j) : A(l, 1) = 
A(2, 2) = 1, A(l, 2) = A(2, 1) = p}. Since the inverse matrix is 



'1 p 


1 


\ 1 -d 


_P !_ 


1 — p 2 


— i 

T— 1 
1 



the joint pdf for the random vector (X, Y) is 



fx,y(x,y) 



ex P (~2 (tV) + ^ - 

271-yi - p 2 




(x,y) € K 2 . 



p is called the “correlation coefficient” between X and Y and 
must satisfy p 2 < 1 for A to be positive definite. To find the pdf 
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of X we complete the square in the exponent so that 



fx,y(x,y) 



exD ( ( v-p x f _ 

eX ! J \ 2(1— p 2 ) 

27Ty / l - p 2 



X 

2 



( ( y—px ) 2 

expt-^p^y 



exp 



X 

2 



\/27r(l - p 2 ) y/2 



7 r 



The pdf of X is determined by integrating with respect to y on 
(— 00 , 00 ). To perform this integration, refer to the form of the 
one-dimensional Gaussian pdf with m = px (note that x is fixed 
while the integration is with respect to y ) and a 2 = 1 — p 2 . The 
first factor in the preceding equation has this form. Because the 
one-dimensional pdf must integrate to one, the pdf of X that 
results from integrating y out from the two-dimensional pdf is 
also a one-dimensional Gaussian pdf; i.e., 

fx(x) = (27ry 1/2 e~ x2/2 . 



As in examples [3.16], [3.17], and [3.18], Y has the same pdf as 
X. Note that by varying p there is a whole family of joint Gaussian 
pdf’s with the same marginal Gaussian pdf’s. 



3.6 Independent Random Variables 

In chapter 2 it was seen that events are independent if the proba- 
bility of a joint event can be written as a product of probabilities 
of individual events. The notion of independent events provides a 
corresponding notion of independent random variables and, as will 
be seen, results in random variables being independent if their joint 
distributions are product distributions. 

Two random variables X and Y defined on a probability space 
are independent if the events X~ 1 (F) and Y~ l (G) are independent 
for all F and G in 23 (5ft). A collection of random variables = 

0, 1, . . . , k — 1} is said to be independent or mutually independent if 
all collections of events of the form {X~ 1 (Fi); i = 0, 1, . . . , k — 1} are 
mutually independent for any iq G 23(5ft); i = 0, 1, . . . , k — 1. 

Thus two random variables are independent if and only if their 
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output events correspond to independent input events. Translating 
this statement into distributions yields the following: 

Random variables X and Y are independent if and only if 

Px,y{F i x F 2 ) = P x (F 1 )P y (F 2 ) , all F U F 2 € B {&) . 

Recall that F\ x is an alternate notation for nL Pi — we win 
frequently use the alternate notation when the number of product 
events is small. Note that a product and not an intersection is used 
here. The reader should be certain that this is understood. The 
intersection is appropriate if we refer back to the original to events, 
that is, using the inverse image formula to write this statement in 
terms of the underlying probability space yields 

Pix-^F,) n y~\f 2 ) = P(x~ 1 (F\)) n y-\f 2 )). 

Random variables Xo, . . . , X&_i are independent or mutually in- 
dependent if and only if 

(k - 1 \ k—1 

Px 0 ,...,x k _Aii F i) = Il p ^y^ 

\i = 0 / z=0 

for all Fi G 5(5R); i = 0, 1, . . . , k — 1. 

The general form for distributions can be specialized to pmf’s, 
pdf’s, and cdf’s as follows. Two discrete random variables X and Y 
are independent if and only if the joint pmf factors as 

Px,r(x,y) =Px(x)p Y (y) Vx,y. 

A collection of discrete random variables JQ;i = 0,l,...,/c — 1 is 
mutually independent if and only if the joint pmf factors as 

k—l 

PXo,.. (zo,...,Xfc-i) = Y[pXi{xi ) ; Vxi. 

i = 0 

Similarly, if the random variables are continuous and described by 
pdf’s, then two random variables are independent if and only if the 
joint pdf factors as 

fx,Y{x,y) = fx(x)fy(y) ; V x,y G K. 
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A collection of continuous random variables is independent if and 
only if the joint pdf factors as 

k-i 

fx . . -,Xk-l) = JJ fXi(Xi) . 

i = 0 

Two general random variables (discrete, continuous, or mixture) 
are independent if and only if the joint cdf factors as 

Fx,y(x, y ) = F x (x)F Y (y) ; V x, y <E K . 

A collection of general random variables is independent if and only 
if the joint cdf factors 

k - 1 

Fx 0i ...,x k _ 1 (xo,---,x k -i) = Y[ F Xi(xi) ; ,V (x 0 ,x 1 ,...,x k _ 1 ) G . 

2=0 

We have separately stated the two-dimensional case because of 
its simplicity and common occurrence. The student should be able 
to prove the equivalence of the general distribution form and the 
pmf form. If one does not consider technical problems regarding 
the interchange of limits of integration, then the equivalence of the 
general form and the pdf form can also be proved. 



IID Random Vectors 

A random vector is said to be independent, identically distributed or 
iid if the coordinate random variables are independent and identically 
distributed; that is, if 

• the distribution is a product distribution, i.e., it has the form 

( k — 1 \ k-1 

iH = n p - Xi ^ 

2=0 / 2=0 

for all choices of Fi G B(jSt),i = 0, 1, . . . , k — 1, and 
• if all the marginal distributions are the same (the random variables are all 
equivalent), i.e., if there is a distribution Px such that Px^F) = Px(F); 
all F G S(SR) for all i. 
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For example, a random vector will have a product distribution if 
it has a joint pdf or pmf that is a product pdf or pmf as described in 
example [2.16]. The general property is easy to describe in terms of 
probability functions. The random vector will be iid if it has a joint 
pdf with the form 



/x(x) = 11 fx(xi) 
i 

for some pdf fx defined on 5ft or if it has a joint pmf with the form 

Px(x) = W'Pxixi) 
i 

for some pmf px defined on some discrete subset of the real line. Both 
of these cases are included in the following statement: A random 
vector will be iid if and only if its cdf has the form 

Oc(x) = Y[Fx{xi) 



for some cdf Fx- 

Note that, in contrast with earlier examples, the specification 
“product distribution,” along with the marginal pdf’s or pmf’s or 
cdf’s, is sufficient to specify the joint distribution. 



3.7 Conditional Distributions 

The idea of conditional probability can be used to provide a general 
representation of a joint distribution as a product, but a more com- 
plicated product than arises with an iid vector. As one would hope, 
the complicated form reduces to the simpler form when the vector is 
iid. The individual terms of the product have useful interpretations. 

The use of conditional probabilities allows us to break up many 
problems in a convenient form and focus on the relations among ran- 
dom variables. Examples to be treated include statistical detection, 
statistical classification, and additive noise. 
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3.7.1 Discrete Conditional Distributions 



We begin with the discrete alphabet case as elementary conditional 
probability suffices in this simple case. We can derive results that ap- 
pear similar for the continuous case, but nonelementary conditional 
probability will be required to interpret the results correctly. 

Begin with the simple case of a discrete random vector (X, Y) with 
alphabet A x x Ay described by a pmf px,y( x iV)- Let px and py 
denote the corresponding marginal pinks. Define for each x E Ax 
for which px( x ) > 0 the conditional pmf p Y \ x (y\x);y E Ay as the 
elementary conditional probability of Y = y given X = x, that is, 



pmfvM - nr P[Y 

P({oj : Y( uj) = y} H {uj : X(u) = x}) 
P({oj : X(l a) = x}) 

= px,y( x i y) 

Px(x) 



(3.45) 



where we have assumed that p x {%) > 0 for all suitable x to avoid 
dividing by 0. Thus a conditional pmf is just a special case of an 
elementary conditional probability. For each x a conditional pmf is 
itself a pmf, since it is clearly nonnegative and sums to 1: 



Y py\x{v\x ) = Yi 
yeAy y^Ay 



px,r(x,y) 

px{x) 



1 

px(x) 



p X (x) = 1. 



1 

px( x) 



Y px,y(x, y) 

yeAy 



We can compute conditional probabilities by summing conditional 
pmf’s, i.e., 



P(Y € F\X = x) = YPY\x(y\x). (3.46) 

yeF 

The joint probability can be expressed as a product as 



Px,Y(x,y) = p Y \x(y\x)px(x). (3.47) 



Unlike the independent case, the terms of the product do not each 
depend on only a single independent variable. If X and Y are inde- 
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pendent, then py\x (y\ x ) = Py{v) and the joint pmf reduces to the 
product of two marginals. 

Given the conditional pmf py\x and the pmf px, the conditional 
pmf with the roles of the two random variables reversed can be com- 
puted by marginal pmf’s by 



Px\y{x\v) 



px,y(x,y) 

Pr{y ) 



PY\x{y\x)px{x) 

EuPY\x{y\u)px(n)' 



( 3 . 48 ) 



a result often referred to as Bayes ’ rule. 

The ideas of conditional pmf’s immediately extend to random vec- 
tors. Suppose we have a random vector (Xo,Xi, . . . , X&_i) with a 
pmf PXo,Xi,....x k _ 1 -> then (provided none of the denominators are 0) 
we can define for each / = l,2,...,fc — 1 the conditional pmf’s 



PX l \X Q ,...,Xi- 1 ( x l\ x fh • • • -> x i- 1) 



PXo,...,X l (x 0 , ... ,Xl) 
Px o,...,Xi_ 1 (xo, ■ ■ ■ ,Xl- 1 ) 



( 3 . 49 ) 



Then simple algebra leads to the chain rule for pmf’s: 



Px o ,x 1 ,...,x n - 1 (x 0 ,xi, . . .X n -i) 

f PX o,Xi,...,X n _i (XOj X\, . . . X n _i) \ 

\px o,x 1 ,...,x n - 2 (x 0 ,xi , . ..x n - 2 )J 



Px 0 ,x i,...,x n _ 2 (a ; o, Xi,... x n -2 ) 



n— 1 



px 0 (* o ) ii px " x - v ; ,r! " ,r; r,) , 

f = l PXo,X 1 ,...,X i - 1 {X 0 ,X 1 ,...Xi-i) 



n— 1 



PX o(*o) n • • • , XI- 1) 



1=1 



( 3 . 50 ) 



a product of conditional probabilities. This provides a general form 
of the iid product form and reduces to the iid product form if indeed 
the random variables are mutually independent. This formula plays 
an important role in characterizing memory in random vectors and 
processes. It can be used to construct joint pmf’s, and can be used 
to specify a random process. 




152 Random Objects 

3.7.2 Continuous Conditional Distributions 

The situation with continuous random vectors is more complicated 
if rigor is required, but the mechanics are quite similar. Again begin 
with the simple case of two random variables X and Y with a joint 
distribution, now taken to be described by a pdf fx,Y • We define 
the conditional pdf as an exact analog to that for pmf’s: 

f ( | N A fx,Y(x,y ) , Q K-n 

frlx(vlx) = ~1^T' (3 ' 51) 

This looks the same as the pmf, but it is not the same because pmf’s 
are probabilities and pdf’s are not. A conditional pmf is an elemen- 
tary conditional probability. A conditional pdf is not. It is also not 
the same as the conditional pdf of example [ 2 . 19 ] as in that case 
the conditioning event had nonzero probability. The conditional pdf 
fy\x can , however, be related to a probability in the same way an 
ordinary pdf (and the conditional pdf of example [ 2 . 19 ]) can. An 
ordinary pdf is a density of probability, it is integrated to compute a 
probability. In the same way, a conditional pdf can be interpreted as 
a density of conditional probability, something you integrate to get 
a conditional probability. Now, however, the conditioning event can 
have probability zero and this does not really fit into the previous 
development of elementary conditional probability. Note that a con- 
ditional pdf is indeed a pdf, a nonnegative function that integrates 
to one. This follows from 



j fnx(v\*) Ay = j f X^dy 

= tXS lxAx ' v)dy 

= 7Tt/iW = 1, 
fx{x) 

provided we require that fx(x) > 0 over the region of integration. 

To be more specific, given a conditional pdf /y\x-> we will make 
a tentative definition of the (nonelementary) conditional probability 
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that Y G F given X = x is 



P(YeF\X = x) = J f Y \ x (y\x) 



(3.52) 



Note the close resemblance to the elementary conditional probabil- 
ity formula in terms of conditional pmf’s of (3.46). For all practical 
purposes (and hence for virtually all of this book), this construc- 
tive definition of nonelementary conditional probability will suffice. 
Unfortunately it does not provide sufficient rigor to lead to a useful 
advanced theory. Section 3.17 discusses the problems and the cor- 
rect general definition in some depth, but it is not required for most 
applications. 

Via almost identical manipulations to the pmf case in (3.48), con- 
ditional pdf’s satisfy a Bayes’ rule: 



f . , x fx,r(x,y) 

fx ' Y(x]y> = “frteT 



fY\x(y\x)fx(x ) 

J fY\x{y\u)fx(u)du ' 



(3.53) 



As a simple but informative example of a conditional pdf, con- 
sider the generalization of Example [3.19] to the case of a two- 
dimensional vector U = (X, Y) with a Gaussian pdf having a mean 
vector (mx,my) t and a covariance matrix 



A = a X P°X°Y 

[pax cry ay 



(3.54) 



where p is the correlation coefficient of X and Y . Straightforward 
algebra yields 



det(A) = cr^cry( 1 — 



(i-P 2 ) - 



p 

<JX°Y 



P 

axo-y 

1 



(3.55) 

(3.56) 
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so that the two-dimensional pdf becomes 



fxy{x-,y) 

l 



y/ 27rdet A 
1 



e -\{x-m x ,y-m Y ) A 1 (x-m x ,y-m Y ) t 



exp 



2(1 -P 2 ) 



x 



2t tctxvyV 1 ~ P 2 

- rax)(g/ - + 






(JX&Y 



(Jy 



(3.57) 



A little algebra to rearrange the expression yields 



fxv(x,y ) = 



exp 



(_i(^ )2 ) exp (-J( 



1 /|/-my-(p(jy/(7x)(x-mx)^2 






p 2 a y 



2xo\ 



^2tt< 4(1 - p 2 ) 



(3.58) 



from which it follows immediately that the conditional pdf is 



fy\x{y\x) = 



exp (- r ,, 'lr,g(-"- ,) ) 1 ) 



2j[ (Jy ( 1 - p 2 ) 



(3.59) 



which is itself a Gaussian density with variance cry\ x = 0 y(l — p 2 ) 

and mean my|x = y — my + p(cy ~ mx)- Integrating y out 
of the joint pdf then shows that as in Example [3.19] the marginal 
pdf is also Gaussian: 

1 1 ( x ~ rn X \2 

fx(x) = -=e 21 -X J . (3.60) 

ax v x 

A similar argument shows that also fy{y) and fx\y( x \y) are a l so 
Gaussian pdf’s. Observe that if X and Y are jointly Gaussian, then 
they are also both individually and conditionally Gaussian! 



A chain rule for pdf’s follows in exactly the same way as that for 
pmf’s. Assuming fx 0 ,x u ...,x i (x o,xi, ...Xi)> 0, 
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f Xo,Xi,...,X n -i (xo, X\ > • • • %n— l) 

= fx Q ,X u ...,X n _ 1 (x 0 ,Xl, . ..Xn-l) 
fx 0 ,X 1 ,...,X n - 2 ( x O, Xi,... X n _ 2 ) 



fx, 



Q,Xi,...,X n -2 



(,Xq, X\i • • • X n —2) 



n— 1 

= fx 0 (®o) n 

i— 1 



f Xo,Xi ,...,X{ (xo i X\, ... xfi) 

fx 0*4), #!>••• ^t-l) 



fc-1 

= fx 0 (xo) fl /x i |x 0 ,...,x i _ 1 (®/|a ; o, • ■ - , ajf—i). 
z=i 



(3.61) 



3.8 Statistical Detection and Classification 

Consider a simple, but nonetheless very important, example of the 
application of conditional probability mass functions describing dis- 
crete random vectors. Suppose that X is a binary random variable 
described by a pmf px-> with px(X) — V- X is possibly one bit in 
some data coming through a modem. You receive a random variable 
Y, which is the equal to X with probability 1 — e. In terms of a 
conditional pmf this is 

/ , x I e x y 

Pv\x{y\x) = < (3.62) 

II — e x — y. 

This can be written in a simple form using the idea of modulo 2 
(or mod 2) arithmetic which will often be useful when dealing with 
binary variables. Modulo 2 arithmetic or the “Galois field of 2 ele- 
ments” arithmetic consists of an operation © called modulo 2 addi- 
tion defined on the binary alphabet {0, 1} as follows: 

0 0 1 = 10 0 = 1 (3.63) 

0 © 0 = 1 © 1 = 0. (3.64) 

The operation © corresponds to an “exclusive or” in logic; that is, it 
produces a 1 if one or the other but not both of its arguments is 1. 
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An equivalent definition for the conditional pmf is 

PY\x(y\x) = e x ® y (l - e) 1 -*®*'; (3.65) 

For example, the channel over which the bit is being sent is noisy 
in that the receiver occasionally makes an error. Suppose that it is 
known that the probability of such an error is e. The error might be 
very small on a good phone line, but it might be very large if an evil 
hacker is trying to corrupt your data. Given the observed T, what is 
the best guess X(Y) of what is actually sent? In other words, what 
is the best decision rule or detection rule for guessing the value of X 
given the observed value of Y? A reasonable parameter for judging 
the quality of an arbitrary rule X is the resulting probability of error 



P e {X) = Pr{X{Y)^X). (3.66) 

A decision rule is optimal if it yields the smallest possible probability 
of error over all possible decision rules. A little probability manipu- 
lation quickly yields the optimal decision rule. Instead of minimizing 
the error probability, we maximize the probability of being correct: 

Pr(! = X) = 1 - P e (X) = Y, PXY(x, y) 

(x,y):X(y)=x 

= Y Px\Y(x\y)p Y {y) 

0 x,y):X(y)=x 

= Y Py ^\ Y Px\y{x\v) 

y \x:X(y)=x 

= Y PY ^P x \ Y ^(y)\y)- 

y 

To maximize this sum, we want to maximize the terms within the 
sum for each y. Clearly the maximum value of the conditional prob- 

A 

ability, Px\y{X(y)\y) = meix u p x \Y( u \y)i will be achieved if we define 

A 

the decision rule X(y) to be the value of u achieving the maximum of 
Px\y( u \u) over u i that is, define X to be argmax w px|v(^l2/) (also de- 
noted max^ 1 Px\y( u \v))- I n words: the optimal estimate of X given 
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the observation Y in the sense of minimizing the probability of error 
is the most probable value of X given the observation. This is called 
the maximum a posteriori or MAP decision rule. In our binary exam- 
ple it reduces to choosing x = y if e < 1/2 and x = 1 — y if e > 1/2. 
If e = 1/2 you can give up and flip a coin or make an arbitrary de- 
cision. (Why?) Thus the minimum (optimal) error probability over 
all possible rules is min(e, 1 — e). 

The astute reader will notice that having introduced conditional 
pmf’s Py\x-> the example considered the alternative pmf Px\y- The 
two are easily related by Bayes’ rule (3.48). 

A generalization of the simple binary detection problem provides 
the typical form of a statistical classification system. Suppose that 
Nature selects a “class” i7, a random variable described by a pmf 
p#(/i), which is no longer assumed to be binary. Once the class is se- 
lected, Nature then generates a random “observation” X according to 
a pmf Px\h • Tor example, the class might be a medical condition and 
the observations the results of blood pressure, patients age, medical 
history, and other information regarding the patients health. Alter- 
natively, the class might be an “input signal” put into a noisy channel 
which has the observation X as an “output signal.” The question 
is: Given the observation X — x, what is the best guess H(x) of the 
unseen class? If by “best” we adopt the criterion that the best guess 

A 

is the one that minimizes the error probability P e = Pr (H(X) H), 

then the optimal classifer is again the MAP rule argmax n p^|x(^|^)- 
More generally we might assign a cost C y ^ resulting if the true class 
is h and we guess y. Typically it is assumed that C^h — 0, that is, 
the cost is zero if our guess is correct. (In fact it can be shown that 
this assumption involves no real loss of generality.) Given a classifier 

A 

(classification rule, decision rule) h(x), the Bayes risk is then defined 
as 

B{h) = N C h(x),hPH,x{h, x) , (3.67) 

x,h 

which reduces to the probability of error if the cost function is given 
by 

C y ,h = 1 — $y,h • 



( 3 . 68 ) 
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The optimal classifier in the sense of minimizing the Bayes risk is 
then found by observing that the inequality 



B{h) = J2px(x)J2c h{xlh p H]x (h\x) 



X 



h 



> Jjpxix) min ]T Cy,hPH\x(h\x) ) , 
x y \ h ) 



which lower bound is achieved by the classifier 



h{x) = argmin V C y , h p H \ x {h\x) , 

v \ h J 



( 3 . 69 ) 



the minimum average Bayes risk classifier. This reduces to the MAP 
detection rule when C y ^ = 1 — 5 y ,h- 



3.9 Additive Noise 

The next examples of the use of conditional distributions treat the 
distributions arising when one random variable (thought of as a 
“noise” term) is added to another, independent random variable 
(thought of as a “signal” term). This is an important example of 
a derived distribution problem that yields an interesting conditional 
probability. The problem also suggests a valuable new tool which will 
provide a simpler way of solving many similar derived distributions 
— the characteristic function of random variables. 



Discrete Additive Noise 

Consider two independent random variables X and W and form a 
new random variable Y = X + W. This could be a description of 
how errors are actually caused in a noisy communication channel 
connecting a binary information source to a user. In order to apply 
the detection and classification signal processing methods, we must 
first compute the appropriate conditional probabilities of the output 
Y given the input X. To do this we begin by computing the joint 
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pmf of X and Y using the inverse image formula: 



Px,y(x , y) = Pr(X = x,Y = y) = Pr(X = x, X + W = y) 

= Y Px,w(a,P) = Px,w(x,y - x) 
a,/3:a=x,a-h/3=y 

= Px(x)p w (y - x). (3.70) 

Note that this formula only makes sense if y — x is one of the values 
in the range space of W. Thus from the definition of conditional 
pmf’s: 



PY\x(y\x) = Px ' Y ^J y \ = pw ( y - x ), (3.71) 

px{x) 

an answer that should be intuitive: given the input is x, the output 
will equal a certain value y if and only if the noise exactly makes up 
the difference, i.e., W — y — x. Note that the marginal pmf for the 
output Y can be found by summing the joint probability: 

Pv{y) = ^Pxyix./y) = ^ ^px(x)p w (y - x), (3.72) 

X X 



a formula that is known as a discrete convolution or convolution sum. 

Anyone familiar with convolutions knows that they can be un- 
pleasant to evaluate, so we postpone further consideration to the 
next section and turn to the continuous analog. 

The above development assumed ordinary arithmetic, but it is 
worth pointing out that for discrete random variables sometimes 
other types of arithmetic are appropriate, e.g., modulo 2 arith- 
metic for binary random variables. The binary example of sec- 
tion 3.8 can be considered as an additive noise example if we de- 
fine a random variable W which is independent of X and has a 
pmf pw(w) = e w (l — e) 1 ~ w \ w = 0, 1 and where Y = X + W is in- 
terpreted as modulo 2 arithmetic, that is, as Y = X © W. This 
additive noise definition is easily seen to yield the conditional pmf of 




160 



Random Objects 



(3.62) and the output pmf via a convolution. To be precise, 



Px,y{ x -> U ) Pr(X = x, Y — y) — Pr(X = x, X © W = y) 

= Y Px,w(a, P) = px,w(x, y®x) 

a,0:a=x,a®/3=y 

= px(x)p w (y®x) (3.73) 



and hence 

Pv\x{y \ x ) = Pxx ^d y \ = Vw [y © x) ( 3 . 74 ) 

PX{X) 

and 



pv(y) = Jjpx,Y{x,y) = ^ ~jpx{x)pw(y ® x ), 

X X 



(3.75) 



a modulo 2 convolution. 



Continuous Additive Noise 

An entirely analogous formula arises in the continous case. Again 
suppose that X is a random variable, a signal , with pdf fx, and 
that VP is a random variable, the noise , with pdf fw- The random 
variables X and W are assumed to be independent. Form a new 
random variable T, an observed signal plus noise. The problem is to 
find the conditional pdf’s fY\x(y\ x ) and fx\y( x \y)- The operation of 
producing an output Y from an input signal X is called an additive 
noise channel in communications systems. The channel is completely 
described by fy\x • The second pdf, fx\Y will prove useful later when 
we try to estimate X given an observed value of Y . 

Independence of X and W implies that the joint pdf is 
fx,w(x,w) = fx(x)fw(w). To find the joint pdf f x ,Y, first eval- 
uate the joint cdf and then take the appropriate derivative. The cdf 
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is a straightforward derived distribution calculation: 

Fxy{x,y) = Pv(X <x,Y < y) 

= Pr(X <x,X + W<y) 

fx,w(a,P) dadfd 




a,(3:a<x,a+[3<y 

da j d/3f x (a)fw(P) 

— OO J — OO 

* X 

daf x {a)F w (y - a). 

— OO 



Taking the derivatives yields 



fxy(x,y) = fx(x)fw(y - x) 



and hence 



fv\x{y\x) = fw(y - x). ( 3 . 76 ) 



The marginal pdf for the sum Y = X + W is then found as 

J fxy{x, y) dx = J fx{x)fw{y ~ x) dx, ( 3 . 

a convolution integral of the pdf’s fx and fw, analogous to the con- 
volution sum found when adding independent discrete random vari- 
ables. Thus the evaluation of the pdf of the sum of two independent 
continuous random variables is of the same form as the evaluation of 
the output of a linear system with an input signal fx and an impulse 
response fw- We will later see an easy way to accomplish this using 
transforms The pdf fx\y follows from Bayes’ rule: 





fx\y(x\y) 



fx(x)fw(y - x) 

f fx(a)f w (y - a) da' 



( 3 . 78 ) 



It is instructive to work through the details of the previous example 
for the special case of Gaussian random variables. For simplicity the 
means are assumed to be zero and hence it is assumed that fx is 
A/”(0 ,ctx), that fw is A/”(0,ay), and that as in the example X and 
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W are independent and Y = X + W . From (3.76) 



fv\x{y\x) = fw(y -x) = 



2 ( 7 * 

e w 



(y-x) : 



Ta^ 



(3.79) 



from which the conditional pdf can be immediately recognized as 
being Gaussian with mean x and variance crjy, that is, as J\f(x, cr 2 v ). 

To evlauate the pdf f x \y using Bayes’ rule, we begin with the 
denominator fy of (3.53) and write 



/ oo 

fY\x(y\oi)fx(a) da 
-oo 



■°° e e 



2tt(J ^ 



2ircr 2 x 



(3.80) 



i /*00 1 r U 2 ~ 2cxy-\-a 2 , a 2 i 

1 / “2L ys + , 

/ e vr a x da 

27 ra x a w J - oo 



e 2cr w 

2na x aw 



•OO 1 r„.2 ( 1 I 1 \ 

_ 9L a + 



X "W "W da . (3.81) 



This convolution of two Gaussian “signals” can be accomplished us- 
ing an old trick called “complete the square.” Call the integral in 
the square brackets at the end of the above equation I and note that 
integrand resembles 

e ~2^~yr~ ) 

We know from (B.13) in appendix B that this integrates to 



_ 1 / a — m \2 / 

e 2 ^ 2 ’ da = v27T<j 2 



since a Gaussian pdf integrates to 1. The trick is to modify / to 
resemble this integral with an additional factor. Compare the two 
exponents: 
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vs. 



\,ol — m x9 1 r cr ^ am 

2 [ ^- 2 — + 



m 



2 a 'Z~ a* a * cr 

The exponent from I will equal the left two terms of the expanded 
exponent in the known integral if we choose 



or, equivalently, 



and if we choose 



or, equivalently, 



1 



G‘ 



1 1 

“o "t" o 



a 



w 



a 



x 



a 2 = 



a x°w 

+ (T, 2 



w 



y 



a 



w 



m 



(3.83) 



<T 



m = 



a 



y- 



w 



Using (3.83) - (3.84) we have that 



a 2 (' 



1 



cr 



1 2ay_ 
^ a 2 J a 2 [ 



a — m 



a 



) 2 - 



nrv 



( 7 ‘ 



(3.84) 



X "W ” w 

where the addition of the leftmost term is “completing the square.” 
With this identification and again using (3.83) - (3.84) we have that 



I = 



•oo 



— oo 



1 \( Oi — m \2 



da — V27T(T 2 e2v' 



(3.85) 



which implies that 



l y‘ 
2^ 



fv(y) = 



w 



2i rcrxo'w 






7 r^e2^ = 



a x +<T w 



2tt(o-| 



+ a w) 



In other words, /y is V(0, o\ + a 2 v ) and we have shown that the 
sum of two zero mean independent Gaussian random variables is 
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another zero mean Gaussian random variable with variance equal to 
the sum of the variances of the two random variables being added. 

Finally we turn to the a posteriori probability fx\Y • From Bayes’ 
rule and a lot of algebra 



fx\v{x\y) = 



fY\x(y\x)fx(x) 

fv(y) 

\r~ (y~ x ) 2 tr-:r 2 

e w e x 

V 2ncr w V 27Ta x 



V 27r ( a x +(7 w) 




1 


A / 27T' 


a x a w 

~2 ; — 



2a x + a w 



1 ry 2 -2yx + x 2 



W 






*x a x+ a w 



9 — T~ 

O a x a w 

~~T 



(x- 



X 



a x +<T M/ y) 



a x +,7 'w 



ala? 



o-Y-hcr 



T~ 

W 



( 3 . 86 ) 



In words: fx\Y{ x \y) is a Gaussian pdf 

4 °4 - a w \ 

T + dA a x + a w ' 

The mean of a conditional distribution is called a conditional mean 
and the variance of a conditional distribution is called a conditional 
variance. 




Continuous Additive Noise with Discrete Input 

Additive noise provides a situation in which mixed distributions hav- 
ing both discrete and continuous parts naturally arise. Suppose that 
the signal X is binary, say with pmf px(%) — p x (l — p ) l ~ x . The noise 
term W is assumed to be a continuous random variable described by 
pdf fw( w ), independent of X, with variance crjy. The observation 
is defined by Y = X + W. In this case the joint distribution is not 
defined by a joint pmf or a joint pdf, but by a combination of the 
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two. Some thought may lead to the reasonable guess that the con- 
tinuous observation given the discrete signal should be describable 
by a conditional pdf fy\x{v\ x ) = fw(y ~ x ) 5 where the conditional 
pdf is of the elementary variety, the given event has nonzero proba- 
bility. To prove that this is in fact correct, consider the elementary 
conditional probability Pr (Y < y\X = x), for x = 0, 1. This is recog- 
nizable as the conditional cdf for Y given X — x, so that the desired 
conditional density is given by 

f Y \x(y\x) = ^ Pr(F < y\X = x). (3.87) 

The required probability is evaluated using the independence of X 
and W as 

Pr(T < y\X = x) = Pr(X + W < y\X = x) = Pr(x + W < y\X = x) 

= Pr (W < y — x) = F w {y — x). 

Differentiating with respect to y gives 



fv\x(y\x) = fw{y - x). 



(3.88) 



The joint distribution is described in this case by a combination 
of a pmf and a pdf. For example, to compute the joint probability 
that IgF and Y G G is accomplished by 

Pr(X € F and Y G G) = ^ p x {x ) f f Y \x(y\x) dy 

F d G 

= y^/Px(x) I fw{y - X ) dy. (3.89) 

p d G 

Choosing F = 5ft yields the output distribution 

Pr (Y G G) = ^2px{x) f f Y \x(y\x)dy 

J G 

= Y px ^ j G f w ( y ~ dy ■ 

Choosing G = (— oo,y] provides a formula for the cdf Fy(y), which 
can be differentiated to yield the output pdf 

fv(y) = X Px G)fY\x{y\ x ) = Y Px ( x )f w ( y ~ x )’ ( 3 - 90 ) 
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a mixed discrete convolution involving a pmf and a pdf (and ex- 
actly the formula one expects in this mixed situation given the pure 
discrete and continuous examples). 

Continuing the parallel with the pure discrete and continuous 
cases, one might expect that Bayes’ rule could be used to evalu- 
ate the conditional distribution in the opposite direction. Since X is 
discrete this is the conditional pmf: 



Px\y{x\v) 



fY\x(y\x)px(x ) 

fv{y) 



fY\x(y\x)px(x ) 

ZaPx(a)f Y \x(y\a ) ' 



(3.91) 



Observe that unlike previously treated conditional pmf’s, this one 
is not an elementary conditional probability since the conditioning 
event does not have nonzero probability. Thus it cannot be defined 
in the original manner, but must be justified in the same way as 
conditional pdf’s, that is, by rewriting the joint distribution (3.89) 
as 



Pr(X e F and Y e G) 




dyf Y (y ) Pr(X £ F\Y = y) 




dyfY{y)^jpx\ Y {x\y), 

F 



(3.92) 



so that Px\y( x \y ) indeed plays the role of a mass of conditional prob- 
ability, that is, 

Pr(X G F\Y = y) = 5>x|y(®|y). (3-93) 

F 



Applying these results to the specific case of the binary input and 
Gaussian noise, the conditional pmf of the binary input given the 
noisy observation is 



Px\v{x\y) 



fw(y - x)p x {x) 
fy(y) 

fw{y - x)px(x) 
E a Px(a)fw(y - a) 



; y € x € {0, 1}. 



(3.94) 



This formula now permits the analysis of a classical problem in com- 
munications, the detection of a binary signal in Gaussian noise. 
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3.10 Binary Detection in Gaussian Noise 

The derivation of the MAP detector or classifier extends immediately 
to the the situation of a binary input random variable and indepen- 
dent Gaussian noise just treated. As in the purely discrete case, the 
MAP detector X(y) of X given Y — y is given by 



X{y) = argmax p x \y{x\y) = argmax DlNN 

x x YlaPx(a)f w (y - a) 



(3.95) 



Since the denominator of the conditional pmf does not depend on x 
(only on y), given y the denominator has no effect on the maximiza- 
tion 



X(y) = argmax p x \y(x\y) = argmax f w (y - x)p x (x). 

X X 



Assume for simplicity that X is equally likely to be 0 or 1 so that 
the rule becomes 



X(y) = argmax y>x|y (x\y) = argmax 

X X 



\J 2 ' K(J w 



1 (x-y)‘ 

2 “72 

W 



The constant in front of the pdf does not effect the maximization. 
In addition, the exponential is a mononotically decreasing function 
of \x — ?/| , so that the exponential is maximized by minimizing this 
magnitude difference, i.e., 



A 

X(y) = argmax px | 1 2/) = argmin \x — y |, 



X 



X 



(3.96) 



which yields a final simple rule: see if x = 0 or 1 is closer to y as the 
best guess of x. This choice yields the MAP detection and hence the 
minimum probability of error. In our example this yields the rule 



Jo y < 0.5 
[1 y > 0.5 * 

Because the optimal detector chooses the x that minimizes the Eu- 
clidean distance \x — y\ to the observation y, it is called a minimum 
distance detector or rule. Because the guess can be computed by 
comparing the observation to a threshold (the value midway between 
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the two possible values of x), the detector is also called a threshold 
detector. 

Assumptions have been made to keep things fairly simple. The 
reader is invited to work out what happens if the random variable X 
is biased and if its alphabet is taken to be {—1, 1} instead of {0, 1}. 
It is instructive to sketch the conditional pmf’s for these cases. 

Having derived the optimal detector, it is reasonable to look at the 
resulting, minimized, probability of error. This can be found using 
conditional probability: 

P e = Pr(X(T) ^ X ) 

= Pr(X(T) ^ 0\X = 0)p x (0) + Pr(X(T) ^ 1|X = l)p x (l) 

= Pr(T > 0.5|X = 0)pw(0) + Pr(y < 0.5|X = l)p x {l) 

= Pr (W + X > 0.5|X = 0)px(0) + Pr(VP + X < 0.5|X = l)p x (l) 
= Py(W > 0.5|X = 0)px(0) + Pr(W + 1 < 0.5|X = l)px(l) 

= Py(W > 0.5)px(0) + Py(W < — 0.5)px(l) 



where we have used the independence of W and X. These probabil- 
ities can be stated in terms of the T function of (2.78) as in (2.82), 
which combined with the assumption that X is uniform and (2.84) 
yields 



1 0 ^ 

p e = -i i -*(—) + *( 

z a iy a\y 



law 



(3.98) 



3.11 Statistical Estimation 

Discrete conditional probabilities were seen to provide a method 
for guessing an unknown class from an observation: if all incorrect 
choices have equal costs so that the overall optimality criterion is to 
minimize the probability of error, then the optimal classification rule 
is to guess that the class X = fc, where Px\y(k\y) = max z p x \y (x\y), 
the maximum a posteriori or MAP decision rule. There is an anal- 
ogous problem and solution in the continuous case, but the result 
does not have as strong an interpretation as in the discrete case. A 
more complete analogy will be derived in the next chapter. 
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As in the discrete case, suppose that a random variable Y is ob- 
served and the goal is to make a good guess X(Y) of another ran- 
dom variable X that is jointly distributed with Y . Unfortunately 
in the continuous case it does not make sense to measure the qual- 
ity of such a guess by the probability of its being correct because 
now that probability is usually zero. For example, if Y is formed by 
adding a Gaussian signal X to an independent Gaussian noise W to 
form an observation Y = X + W as in the previous section, then no 
rule is going to recover X perfectly from Y . Nonetheless, intuitively 
there should be reasonable ways to make such guesses in continu- 
ous situations. Since X is continuous, such guesses are refered to as 
“estimation” or “prediction” of X rather than as “classification” or 
“detection” as used in the discrete case. In the statistical literature 
the general problem is referred to as “regression”. 

One approach is to mimic the discrete approach on intuitive 
grounds. If the best guess in the classification problem of a random 
variable X given an observation Y is the MAP classifier Aqy/[Ap(^) = 
argmax x .pj^|y(x|?/), then a natural analog in the continuous case is 
the so-called MAP estimator defined by 

A 

VYlAP(y) = argrnax x / X |y(x'|y), (3.99) 

the value of x maximizing the conditional pdf given y. The advan- 
tage of this estimator is that it is easy to describe and provides an 
immediate application of conditional pdf’s paralleling that of clas- 
sification for discrete conditional probability. The disadvantage is 
that we cannot argue that this estimate is “optimal” in the sense of 
optimizing some specified criterion, it is essentially an ad hoc (but 
reasonable) rule. As an example of its use, consider the Gaussian 

signal plus noise of the previous section. There it was found that the 

2 

pdf fx\Y( x \y) i s Gaussian with mean ?- y. Since the Gaussian 

density has its peak at its mean, in this case the MAP estimate of 

2 

X given Y = y is given by the conditional mean T y. 

< 7 X~^ a W 

Knowledge of the conditional pdf is all that is needed to define an- 
other estimator: the maximum likelihood or ML estimate of X given 
Y = y is defined as the value of x that maximizes the conditional pdf 
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fy\x{y\ x )-> pdf with the roles of input and output reversed from 
that of the MAP estimator. Thus 

/s 

VmlO) = argmax/y| X (y|x). (3.100) 

X 

A 

Thus in the Gaussian case treated above, Xy^(Y) = y. 

The main interest in the ML estimator in some applications is that 
it is sometimes simpler. It also does not require any assumption on 
the input probabilities. The MAP estimator depends strongly on fx\ 
the ML estimator does not depend on it at all. In the special case 
where the input pdf fx is uniform and the conditional pdf fy\x{y\ x ) 
is 0 wherever fx( x ) — 0, then maximizing fy\x{v\ x ) over x is equiva- 
lent to maximizing fx\y(x\y) = fY\x{y\x)fY(y)/ fx(x) over x so that 
the MAP estimator and the ML estimator are the same. 



3.12 Characteristic Functions 

We have seen that summing two random variables produces a new 
random variable whose pmf or pdf is found by convolving the two 
pmf’s or pdf’s of the original random variables. Anyone with an 
engineering background will likely have had experience with convo- 
lution and know from experience that convolutions can be somewhat 
messy to evaluate. To make matters worse, if one wishes to sum ad- 
ditional independent random variables to the existing sum, say form 
Y/v = J2k = l ^N-k^k from an iid collection {X*.}, then the result will 
be an X-fold convolution, a potential nightmare in all but the sim- 
plest of cases. As in other engineering applications such as circuit 
design, convolutions can be avoided by Fourier transform methods. 
In this subsection we describe the method as an alternative approach 
for the examples to come. We begin with the discrete case. 

Historically the transforms used in probability theory have been 
slightly different from those in traditional Fourier analysis. For a dis- 
crete random variable with pmf px, define the characteristic function 
Mx of the random variable (or of the pmf) as 

M x (ju ) = ^ ~2p x (x)e JUX 



X 



1 



(3.101) 
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where u is usually assumed to be real. Recalling the definition (2.34) 
of the expectation of a function g defined on a sample space, choosing 
g{uf) = e^ uX ^ shows that the characteristic function can be be more 
simply defined as 



M x (ju) = E[e juX ]. (3.102) 

Thus characteristic functions, like probabilities, can be viewed as 
special cases of expectations. 

This transform, which is also referred to as an exponential trans- 
form or operational transform , bares a strong resemblance to the 
discrete-parameter Fourier transform 

Mpx) = J2PxW e ~ j2 ™ X (3-103) 



and the z-transform 



Z z (px) = ^px{x)z x . (3.104) 

X 

In particular, Mx(ju) = X_ u /2 n (px) — Z e j u (Px)- As a result, all 
of the properties of characteristic functions follow immediately from 
(are equivalent to) similar properties from Fourier or 2 transforms. 
As with Fourier and z transforms, the original pmf px can be re- 
covered from the transform Mx by suitable inversion. For example, 
given a pmf px(k); k E Zj y, 




= Z Px ^ k ~ x = Px ^>- (3.105) 

X 



Consider again the problem of summing two independent random 
variables X and W with pmf’s px and pw and characteristic func- 
tions Mx and Mw, respectively. If Y = X + W, as before we can 
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evaluate the characteristic function of Y as 

My{ju ) = ^py(y)e- ?U2/ 
y 

where from the inverse image formula 

Py{v) = X p x ,w(x,w) 

x,w:x+w=y 



so that 



My(ju) = ^2 ( Px,w(x,w) ) e Juy 

y \x,w:x-\-w=y 

= X ( X Px,w(x,w)e 3uy 

y \x,w:x j rw=y 

= X( X Px,w(*Me ju{x+w) 

y \x,w:x+w=y 

= Y.Px,w{x,w)e ju{x+w) 

X,W 



where the last equality follows because the double sum for all y, x 
and w is the sum for all x and w. This last sum factors, however, as 

My(ju) = y ^px(x)pw(uj)e jux e juw 

x,w 

= J2px(x)e 3UX J2p w (w)e juw 

X W 

= M x (ju)M w (ju), (3.106) 

which shows that the transform of the pmf of the sum of independent 
random variables is simply the product of the transforms. 

Iterating (3.106) several times gives an extremely useful result that 
we state formally as a theorem. It can be proved by repeating the 
above argument, but we shall later see a shorter proof. 

Theorem 3.1 If {JQ; i — 1, . . . , N} are independent random vari- 
ables with characteristic functions Mx t , then the characteristic func- 
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tion of the random variable Y = Y^iLi X{ is 

N 

M Y {ju ) = J1 Mx„ (ju) . (3.107) 

i = 1 

If the Xi are independent and identically distributed with common 
characteristic function Mx, then 

M Y {ju) = M%{ju). (3.108) 

As a simple example, the characteristic function of a binary ran- 
dom variable X with parameter p = px(X) = 1 — Px(0) is easily 
found to be 

l 

Mx(ju) = ^ uk px{k) — (1 — p) +pe^ u . (3.109) 

k = o 

If {Xi\ i — 1, . . . , n} are independent Bernoulli random variables 
with identical distributions and Y n — Xi, then My n (ju) = 

[(1 — p) pe JU ] n and hence 

n 

M Yn {ju) = k)e> uk = ((1 - p) + pe^ u ) n 

k=0 

= E \('l)a-p) n - k p k ]e juk , 

k = o LV / 

where we have invoked the binomial theorem in the last step. For the 
equality to hold, however, we have from the uniqueness of transforms 
that PY n (k) must be the bracketed term, that is, the binomial pmf 

pyM = (ffj (l - P )"-V; k g z„ +1 . (3.iio) 

As in the discrete case, convolutions can be avoided by transform- 
ing the densities involved. The derivation is exactly analogous to the 
discrete case, with integrals replacing sums in the usual way. 

For a continous random variable X with pmf fx, define the char- 
acteristic function Mx of the random variable (or of the pdf) as 



M x (ju ) 



fx(x)e JUX dx. 



(3.111) 
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As in the discrete case, this can be considered as a special case of 
expectation for continuous random variables as defined in (2.34) so 
that 



M x {ju) = E[e juX }. (3.112) 

The characteristic function is related to the the continuous- 
parameter Fourier transform 

Mfx) = J fx(x)e~^ x dx (3.113) 

and the Laplace transform 

Cs(fx) = J fx{x)e~ sx dx (3.114) 

by M x (ju) = F-u/ 2 -kUx) = £ju{fx)- As a result, all of the proper- 
ties of characteristic functions of densities follow immediately from 
(are equivalent to) similar properties from Fourier or Laplace trans- 
forms. For example, given a well-behaved density x E !R with 

characteristic function Mx(ju), 

fx{x) = j-[ Mx(ju)e~ JUX du. (3.115) 

Consider again the problem of summing two independent random 
variables X and Y with pdf’s fx and fw with characteristic functions 
Mx and Mw-> respectively. As in the discrete case it can be shown 
that 



M Y {ju) = M x {ju)M w (ju) . (3.116) 

Rather than mimic the proof of the discrete case, however, we post- 
pone the proof to a more general treatment of characteristic functions 
in chapter 4. 

As in the discrete case, iterating (3.116) several times yields the fol- 
lowing result, which now includes both discrete and continous cases. 

Theorem 3.2 If {A^; 2 = 1,... , N} are independent random vari- 
ables with characteristic functions Mx { , then the characteristic func- 




3.12 Characteristic Functions 



175 



tion of the random variable Y = Y^iLi Xi is 

N 

M y (J u ) = n M X ,. U u ) ■ (3.117) 

i = 1 

If the Xi are independent and identically distributed with common 
characteristic function Mx, then 

M Y (ju) = Mx(ju). (3.118) 

As an example of characteristic functions and continuous random 
variables, consider the Gaussian random variable. The evaluation 
requires a bit of effort, either using the “complete the square” tech- 
nique of calculus or by looking up in published tables. Assume that 
X is a Gaussian random variable with mean m and variance a 2 . 
Then 

/ OO 1 

■jz 2 . ll 2 e-^ 2 ^ 2 e ux dx 

-oo ( 27 rcr 2 ) i /^ 

D° 1 

— (x 2 — 2mx— 2a 2 jux+m 2 ) /2a 2 j 

oo(2^)l/2 6 

_ f f°° 1 

\J oo (27T(7 2 ) 1 / 2 6 

= e ^rn-u 2 a 2 /2 . (3.H9) 

Thus the characteristic function of a Gaussian random variable 
with mean m and variance is 

M x {ju) = e i“ m -« 2 °' 2 / 2 . (3.120) 

If { Xi ; i — 1, . . . ,n} are independent Gaussian random variables 
with identical densities A/*(ra, a 2 ) and Y n — Y7k = i Xi, then 

M Yn (ju) = [ e ^ m ~ u ^ 2 / 2 ] n = e Mnm)-u\na 2 )/2 ^ ( 3 . 121 ) 

which is the characteristic function of a Gaussian random variable 
with mean nm and variance na 2 . 

The following maxim should be kept in mind whenever faced with 
sums of independent random variables: 

When given a derived distribution problem involving the sum of 



(x-(m-\-jua 2 )) 2 /2a 2 | e jum-y 2 a 2 /2 
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independent random variables, first find the characteristic function 
of the sum by taking the product of the characteristic functions 
of the individual random variables. Then find the corresponding 
probability function by inverting the transform. This technique is 
valid if the random variables are independent — they do not have 
to be identically distributed. 



3.13 Gaussian Random Vectors 

A random vector vector is said to be Gaussian if its density is Gaus- 
sian, that is, if its distribution is described by the multidimensional 
pdf explained in chapter 2. The component random variables of 
a Gaussian random vector are said to be jointly Gaussian random 
variables. Note that the symmetric matrix A of the k — dimensional 
vector pdf has k(k + l)/2 parameters and that the vector m has k 
parameters. On the other hand, the k marginal pdf’s together have 
only 2k parameters. Again we note the impossibility of constructing 
joint pdf’s without more specification than the marginal pdf’s alone. 
However, the marginals suffice to describe the entire vector if we also 
know that the vector has independent components, e.g., the vector 
is iid. In this case the matrix A is diagonal. 

Although difficult to describe, Gaussian random vectors have sev- 
eral nice properties. One of the most important of these properties 
is that linear or affine operations on Gaussian random vectors pro- 
duce Gaussian random vectors. This result can be demonstrated 
with a modest amount of work using multidimensional characteristic 
functions, the extension of transforms from scalars to vectors. 

The multidimensional characteristic function of a distribution is 
defined as follows: Given a random vector X = (Xo, . . . , X n _i) and 
a vector parameter u = (uo, . . . , u n - 1 ), the n-dimensional character- 
istic function Mx(j u) is defined by 

Afx(ju) = Mxo^Xn-AjUO, ■ ■ ■ ,ju n - 1) = E (e JU * x ) 

= E (exp j^ju k X k J . (3.122) 

V fc=o / 

It can be shown using multivariable calculus (problem 3.57) that a 
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Gaussian random vector with mean vector m and covariance matrix 
A has characteristic function 



Vx(ju) 



_ j tit m — ut Au / 2 

C/ 



= exp 



n— 1 n— 1 n— 1 

jJ2 u krn k -l/2j2J2 u/ c A(fc, m) 
L k = 0 k = 0 m=0 



U 



rn 



(3.123) 



Observe that the Gaussian characteristic function has the same 
form as the Gaussian pdf — an exponential quadratic in its argu- 
ment. However, unlike the pdf, the characteristic function depends 
on the covariance matrix directly, whereas the pdf contains the in- 
verse of the covariance matrix. Thus the Gaussian characteristic 
function is in some sense simpler than the Gaussian pdf. As a fur- 
ther consequence of the direct dependence on the covariance matrix, 
it is interesting to note that, unlike the Gaussian pdf, the charac- 
teristic function is well-defined even if A is only nonnegative definite 
and not strictly positive definite. Previously we gave a definition of 
a Gaussian random vector in terms of its pdf. Now we can give an 
alternate, more general (in the sense that a strictly positive definite 
covariance matrix is not required) definition of a Gaussian random 
vector and hence random process): 

A random vector is Gaussian if and only if it has a characteristic 
function of the form of (3.123). 

This may seem strange at first thought — how can we define a 
vector to be Gaussian by virtue of having a characteristic function 
of a certain form while allowing that the pdf might not exist. If 
the pdf does not exist, then of what is the characteristic function 
the transform? The answer is that in general it is the transform of 
the distribution, which may exist even if the pdf does not. We are 
effectively defining a Gaussian distribution here and not a Gaussian 
pdf. We shall later see that if a distribution is Gaussian but the 
pdf does not exist, then it is an example of what is called a singu- 
lar distribution and that the covariance matrix is singular, it is not 
invertible. 
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3.14 Simple Random Processes 



In this section several examples of random processes defined on sim- 
ple probability spaces are given to illustrate the basic definition of 
an infinite collection of random variables defined on a single space. 
In the next section more complicated examples are considered by 
defining random variables on a probability space which is the output 
space for another random process, a setup that can be viewed as 
signal processing. 



Examples 

[3.22] Consider the binary probability space (D,P, P) with D = 
{0, 1 },P the usual event space, and P induced by the pmf p(0) = 
a and p(l) = 1 — a, where a is some constant, 0 < a < 1. Define 
a random process on this space as follows: 



X(t,co) = cos(cot) 



cos (£), t G 5ft if uj = 1 
1 , £ E 5ft if cj = 0. 



Thus if a 1 occurs a cosine is sent forever, and if a 0 occurs a 
constant 1 is sent forever. 



This process clearly has continuous time and at first glance it might 
appear to also have continuous amplitude, but only two waveforms 
are possible, a cosine and a constant. Thus the alphabet at each 
time contains at most two values with nonzero probability and these 
possible values change with time. Hence this process is in fact a dis- 
crete amplitude process and random vectors drawn from this source 
are described by pmf’s. We can consider the alphabet of the process 
to be either 5ft 7 " or [— 1,1]^, among other possibilities. Fix time at 
t = 7r/2. Then X(tt/2) is a random variable with pmf 

a, if x = 1 
1 — a, if x = 0 . 

The reader should try other instances of time. What happens at 
t = 0, 2tt, 47t, 2mn . . .? 

[3.23] Consider a probability space (D,P, P) with D = 5ft, T — 



Px{ n/2)( x ) — 
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£>(5ft), the Borel field, and probability measure P induced by the 
pdf 



f(r ) = f 1 if r G [0, 1] 

\ 0 otherwise . 



Again define the random process {X(t)} by X(t,uj) = cos (oat); t G 

5ft. 



Again the process is continuous time, but now it has mixed alpha- 
bet because an uncountable infinity of waveforms is possible corre- 
sponding to all angular frequencies between 0 and 1 so that 
is a continuous random variable except at t = 0. X(0,cj) = 1 is a 
discrete random variable. If you calculate the pdf of the random 
variable X(t) you see that it varies as a function of time (problem 
3.33). 

[3.24] Consider the probability space of example [3.23], but cut 
it down to the unit interval so that ([0, 1), £>([0, 1)), P) is the 
probability space, where P is the probability measure induced by 
the pdf /(r) = 1; r G [0, 1). (So far this is just another model for 
the same thing.) For n — 1,2..., X n {oj) = b n {uj) = the n th digit 
binary expansion of cj, that is 

oo 

W = 52 bn 2 “ n 

n = 1 

or equivalently uj = .& 1 & 2&3 • • • in binary. 

{ X n ; n = 1, 2 . . .} is a one-sided discrete alphabet random process 
with alphabet {0, 1}. It is important to understand that nature has 
selected uj at the beginning of time, but the observer has no way to 
determining Pi completely without waiting until the end of time. Na- 
ture only reveals one bit of uj per unit time, so the observer can only 
get an improved estimate of uj as time goes on. This is an excellent 
example of how a random process can be modeled by selecting only a 
single outcome, yet the observer sees a process that evolves forever. 

In this example our change in the sample space to [0, 1] from 5ft 
was done for convenience. By restricting the sample space we did 
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not have to define the random variable outside of the unit interval 
(as we would have had to do to provide a complete description). 

At times it is necessary to extend the definition of a random pro- 
cess to include vector-valued functions of time so that the random 
process is a function of three arguments instead of two. The most 
important extension is to complex- valued random processes, i.e., vec- 
tors of length 2. We will not make such extensions frequently but we 
will include an example at this time. 

[3.25] Random Rotations 

Given the same probability space as in example [3.24], define a 
complex- valued random process {X n } as follows: Let a be a fixed 
real parameter and define 

X n (w) = e jna e j = e J( na + 2nuJ ) ; n = 1 , 2 , 3, — 

This process, called the random rotations process, is a discrete time 
continuous (complex) alphabet one-sided random process. Note that 
an alternative description of the same process would be to define to 
define as the unit circle in the complex plane together with its 
Borel field and to define a process Y n (u) = c n uj for some fixed c E fl; 
for some fixed c E this representation points out that successive 
values of Y n are obtained by rotating the previous value through an 
angle determined by c. 

Note that the joint pdf of the complex components of X n varies 
with time, n, as does the pdf in example [3.23] (problem 3.36). 

[3.26] Again consider the probability space of example [3.24]. We 
define a random process recursively on this space as follows: De- 
fine Xq = uj and 

X n (cu) = 2X n -i(u) mod 1 

= f 2X n _iH if 0 < X n -iM < 1/2 

(2I n _ 1 H-l if 1/2 < X n _i(cj) < 1, 

where r mod 1 is the fractional portion of r. In other words, if 
X n -i(uj) = x is in [0,1/2), then X n (uj) = 2x. If X n -i(uj) = x is 
in [1/2,1), then X n (u)) — 2x — 1. 

[3.27] Given the same probability space as in the example [3.26], 
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define X{t,oS) = cos (t + 2iiuj),t G !R. The resulting random pro- 
cess {X(t)} is continuous time and continuous amplitude and is 
called a random phase process since all of the possible waveforms 
are shifts of one another. Note that the pdf of X(t,uj) does not 
depend on time (problem 3.37. 

[3.28] Take any one of the foregoing (real) processes and quantize 
or clip it; that is, define a binary quantizer q by 

, x f a if r > 0 
,(r) = \6ifr<0 

and define the process Y(t,uj) = q(X(t,uj)), all t. (Typically b = 
—a.) This is a common form of signal processing — converting 
a continuous alphabet random process into a discrete alphabet 
random process be means of quantization. 

This process is discrete alphabet and is either continuous or dis- 
crete time, depending on the original X process. In any case Y (t) 
has a binary pmf that, in general, varies with time. 

[3.29] Say we have two random variables U and V defined on a 
common probability space (0,P, P). Then 

X(t) = Pcos(27r/o£ + TO 

defines a random process on the same probability space for any 
fixed parameter /q. 

All the foregoing random processes are well defined. The pro- 
cesses inherit probabilistic descriptions from the underlying proba- 
bility space. The techniques of derived distributions can be used to 
compute probabilities involving the outputs since, for example, any 
problem involving a single sample time is simply a derived distribu- 
tion for a single random variable, and any problem involving a finite 
collection of sample times is a single random vector derived distribu- 
tion problem. Several examples are explored in the problems at the 
end of the chapter. 
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3.15 Directly Given Random Processes 

3.15.1 The Kolmogorov Extension Theorem 

Consistency of distributions of random vectors of various dimensions 
plays a far greater role in the theory and practice of random processes 
than simply a means for checking the correctness of a computation. 
We have thus far argued that a necessary condition for a set of ran- 
dom vector distributions to describe collections of samples taken from 
a random process is that the distributions be consistent , e.g., given 
marginals and joints we must be able to compute the marginals from 
the joints. The Kolmogorov extension theorem states that consis- 
tency is also sufficient for a family of finite-dimensional vector dis- 
tributions to describe a random process, that is, for there to exist 
a well defined random process that agrees with the given family of 
finite dimensional distributions. We state the theorem without proof 
as the proof is far beyond the assumed mathematical prerequisites 
for this course. (The interested reader is referred to [52, 7, 29].) 
Happily, however, it is often straightforward, if somewhat tedious, to 
demonstrate that the conditions of the theorem hold and hence that 
a proposed model is well-defined. 

Theorem 3.3 Kolmogorov Extension Theorem 
Suppose that one is given a consistent family of finite dimensional 
distributions Px to: x tl ,....x tk _ 1 f or positive integers k and all pos- 
sible sample times t{ G T; i — 0, 1, . . . , k — 1. Then there exists a 
random process { X t ; t G T} that is consistent with this family. In 
other words , in order to completely describe a random process , it is 
sufficient to describe a consistent family of finite dimensional distri- 
butions of its samples. 



3.15.2 IID Random Processes 

The next example extends the idea of an iid vector to provide one of 
the most important random process models. Although such processes 
are simple in that they possess no memory among samples, they are 
a a fundamental building block for more complicated processes as 
well as being an important example in their own right. In a sense 
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these are the most random of all possible random processes because 
knowledge of the past does not help predict future behavior. 

A discrete-time random proces {X n } is said to be iid if all finite- 
dimensional random vectors formed by sampling the process are iid; 
that is, if for any k and any collection of distinct sample times 
t 0 ,ti, . . . ,tk- 1 , the random vector (X to ,X tl ,. . .,X tk _ 1 ) is iid. 

This definition is equivalent to the simpler definition of the in- 
troduction to this chapter, but the more general form is adopted 
because it more closely resembles definitions to be introduced later. 
Iid random processes are often called Bernoulli processes , especially 
in the binary case. 

It can be shown with cumbersome but straightforward effort that 
the random process of [3.24] is in fact iid. In fact, for any given 
marginal distribution there exists an iid process with that marginal 
distribution. Although eminently believable, this fact requires the 
Kolmogorov extension theorem, which states that a consistent family 
of finite-dimensional distributions implies the existence of a random 
process described or specified by those distributions. The demon- 
stration of consistency for iid processes is straightforward. Readers 
are encouraged to convince themselves for the case of n-dimensional 
distributions reducing to n — 1 dimensional distributions. 



3.15.3 Gaussian Random Processes 

A random process is Gaussian if the random vectors 
(Xt 0 , X tl , . . . , X tk _ 1 ) are Gaussian. for all positive integers k 
and all possible sample times i = 0, — 1, 

In order to describe a Gaussian process and verify the consis- 
tency conditions of the Kolmogorov extension theorem, one has 
to provide the A matrices and m vectors for all random vectors 
(X to ,X tl , . . . ,X tk _ 1 ). This is accomplished by providing a mean 
function m(t); t^T and a covariance function A (£,s); £,sGT, 
which then yield all of the required mean vectors and covariance ma- 
trices by sampling, that is, the mean vector for (X* 0 , X tl , . . . , X tk _ 1 ) 
is (m(to), m(ti), . . . , m(t/ c _i)) and the covariance matrix is A = 
l,j G 
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That this family of density functions are in fact consistent is much 
more difficult to verify than was the case for iid processes, but it 
requires straightforward brute force in calculus rather than any deep 
mathematical ideas to to do so. 

The Gaussian random process in both discrete and continuous time 
is virtually ubiquitous in the analysis of random systems. This is 
both because the model is good for a wide variety of physical phe- 
nomena and because it is extremely tractable for analysis. 



3.16 Discrete Time Markov Processes 

An iid process is often referred to as a memoryless process because 
of the independence among the samples. Such a process is both 
one of the simplest random processes and one of the most random. 
It is simple because the joint pmf’s are easily found as products of 
marginals. It is “most random” because knowing the past (or future) 
outputs does not help improve the probabilities describing the cur- 
rent output. It is natural to seek straightforward means of describing 
more complicated processes with memory and to analyze the prop- 
erties of processes resulting from operations on iid processes. A gen- 
eral approach towards modeling processes with memory is to filter 
memoryless processes, i.e., to perform an operation (a form of signal 
processing) on an input process which produces an output process 
that is not iid. In this section we explore several examples of such a 
construction, all of which provide examples of the use of conditional 
distributions for describing and investigating random processes. All 
of the processes considered in this section will prove to be examples 
of Markov processes , a class of random processes possessing a specific 
form of dependence among current and past samples. 



3.16.1 A Binary Markov Process 

Suppose that { X n ; n = 0,l,...}isa Bernoulli process with 



P 



x = 1 



PX n (x) 



1 — p x = 0, 



(3.124) 
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where pG (0, 1) is a fixed parameter. Since the pmf does not depend 
on n, the subscript is dropped and the pmf abbreviated to px • The 
pmf can also be written as 

Px{x) = P x ( 1 — p) l ~ x \ x = 0, 1. (3.125) 

Since the process is assumed to be iid, 

n— 1 

px-(x n ) = P p x ( Xi) = p w ^\l -p ) n ~ w <*">, (3.126) 

i = 0 

where w(x n ) is the number of nonzero X{ in x n , called the Hamming 
weight of the binary vector x n . 

We consider using {X n } as the input to a device which produces 
an output binary process {T n }. The device can be viewed as a signal 
processor or as a linear filter. Since the process is binary, the most 
natural “linear” operations are those in the binary alphabet using 
modulo 2 arithmetic as defined in (3.63-3.64). Consider the new 
random process { Y n ; n = 0, 1, 2, . . .} defined by 



f T 0 n = 0 

[X n ©T n _ 1 n = 1,2,..., 



(3.127) 



where To is a binary equiprobable random variable (py o (0) = 
Py 0 (1) = 0.5) assumed to be independent of all of the X n . This is an 
example of a linear (modulo 2) recursion or difference equation. The 
process can also be defined for n— 1,2,... by 

J 1 if ^ T n _! 

i ji — \ 

[0 if X n = Y n -i 



This process is called a binary autoregressive process. 

It should be apparent that Y n has quite different properties from 
X n . In particular, it depends strongly on past values. Since p < 1/2, 
Y n is more likely to equal T n _i than it is to differ. If p is small, 
for example, Y n is likely to have long runs of 0’s and l’s. {T n } 
is a random process because it has been defined as a sequence of 
random variables on a common experiment, the outputs of the {X n } 
process and an independent selection of Tq. Thus all of its joint pmf ’s 
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PY n (y n ) — Pr (Y n = y n ) should be derivable from the inverse image 
formula. We proceed to solve this derived distribution and then to 
interpret the result. 

Using the inverse image formula in the general sense, which in- 
volves finding a probability of an event involving Y n in terms of the 
probability of an event involving X n (and, in this case, the initial 
value Yq), yields the following sequence of steps: 

pyn(y n )=Pi(Y n = y n ) 

= Pr(Y 0 = 2/o, Yi = yi, U = 2 / 2 , • • • , Y n - 1 = y n - 1 ) 

= Pr(lo = yo, X\ 0 Yo = 2/1 ,X 2 © Yi = y 2 , . . . , X n -\ 0 Y n -2 = 2/n-i) 

= Pr(Y 0 = y 0 , Xi 0 y 0 = yi,X 2 © yi = 2/2, ■ ■ ■ , W-i © y«-2 = Vn-i) 

= Pr(Y 0 = y 0 , Xi = y { 0 y 0 , X- 2 = y 2 ®yi,---, W-i = y n - 1 © ^-2) 

= Py 0 ,A'i,X2,X3,...,Xn-i(2/0 ) yi © 2/0, 2/2 © 2/1, - - - ,2/n-l © 2/n-2) 
n— 1 

= PYj(yo) U Px{yi © Vi- 1 ). (3.128) 

The derivation used the facts that (1) a © b = c if and only if a = 
6© c. (2) the independence of Xi, X 2 , . . . , X n _i, and (3) the 
fact that the X n are iid. This formula completes the first goal, except 
possibly plugging in the specific forms of py 0 and px to get 

1 n— 1 

p Y n{y n ) = - H p^-i(l (3.129) 

2 i=i 

The marginal pmf’s for T n can be evaluated by summing out the 
joints, e.g., 

PYi(yi) = E^o,yi(2/o,2/i) = _p ) 1-2/102/0 

yo yo 



1 




In a similar fashion it can be shown that the marginals for Y n are all 
the same: 

pY n (y) = ^ 2 / = 0 , 1 ; n = 0 , 1 , 2 , ... , 



( 3 . 130 ) 
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and hence as with X n the pmf can be abbreviated as py, dropping 
the subscript. 

Observe in particular that unlike the iid {X n } process, 



71—1 

PYn(y n ) 1 L - \ \py{Vx) 
i= 0 



(3.131) 



(provided p 7 ^ 1/2) and hence {Y n } is not an iid process and the joint 
pmf cannot be written as a product of the marginals. Nonetheless, 
the joint pmf can be written as a product of simple terms, as has 
been done in (3.129). From the definition of conditional probability 
and (3.128) 



PY l \Y 0 ,Y 1 ,...,Y l _ 1 (yi\yo, m, ■ ■ -,yi- 1 ) 



Pyi+ i(V +1 ) 
Pyi{y l ) 



px{yi®yi~ 1 ) 



(3.132) 

and (3.128) is then recognizable as the chain rule (3.50) for the joint 
pmf pyn(y n ). 

Note that the conditional probability of the current output Yj given 
the values for the entire past Yp i — 0 , 1 ,...,/ — 1 depend only on 
the most recent past output Yi-\! This property can be summarized 
nicely by also deriving the conditional pmf 



py^Mlyi-i) 



PYi-iXiiyhyi-i) 

PYi-Ayi-i) 



(3.133) 



which with a little effort resembling the previous derivation can be 
evaluated as 1(1 — p) 1 _ ^ 0 ^-i. Thus the {Y n } process has the 

property that 



PY i \Y 0 ,Y 1 ,...Xi-lyi\y^yH ■ ■ ■ .y*-i) = PY^yy^yi-i)- (3.134) 

A discrete time random process with this property is called a Markov 
process or Markov chain. Such processes are among the most studied 
random processes with memory. 



3.16.2 The Binomial Counting Process 

We next turn to filtering a Bernoulli process that is linear in the 
ordinary sense of real numbers. Now the input processes will be 
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binary, but the output process will have the nonnegative integers as 
an alphabet. Simply speaking, the output process will be formed by 
counting the number of heads in a sequence of coin flips. 

Let {X n } be iid binary random process with marginal pmf pxiX) — 
p=l—px(0). Define a new one-sided random process {Y n ;n = 
0,1,...} by 



fin = 0 n — 0 

{Ek=i X k = Yn-i + X n n = 1,2,... 



(3.135) 



For n > 1 this process can be viewed as the output of a discrete time 
time- invariant linear filter with Kronecker delta response h & given by 
hk = 1 for k > 0 and = 0 otherwise. From (3.135), each random 
variable Y n provides a count of the number of l’s appearing in the 
X n process through time n. Because of this counting structure we 
have that either 



Y n = Y n _ i or Y n = F n _i + 1 ; n = 2, 3, ... . (3.136) 

In general, a discrete time process that satisfies (3.136) is called a 
counting process since it is nondecreasing, and when it jumps, it is 
always with an increment of 1. (A continuous alphabet counting 
process is similarly defined as a process with a nondecreasing output 
which increases in time in steps of 1.) 

To completely describe this process it suffices to have a formula 
for the joint pmf’s 



n 

PYi,...,Y n {yi, • • -,y n ) = PYiivi) YlpY l \Y 1 ,...,Y l _ 1 {yi\yi, • • -,yi~ 1 ) 

Z=1 

(3.137) 

When we have constructed one process {Y n } from an existing process 
{X n }, we need not worry about consistency since we have defined the 
new process on an underlying probability space (the output space of 
the original process), and hence the joint distributions must be con- 
sistent if they are correctly computed from the underlying probability 
measure — the process distribution for the iid process. 

Since Y n is formed by summing n Bernoulli random variables, the 
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pmf for Y n follows immediately from (3.110); it is the binomial pmf 
and hence the process is referred to as the binomial counting process. 

The joint probabilities could be computed using the vector inverse 
image formula as with the binary Markov source, but instead we 
focus on the conditional distributions and compute them directly. 
The same approach could have been used for the binary Markov 
example. 

The conditional pmf’s are computed by describing probabilistically 
the next output Y n of the process given the previous n — 1 outputs 
Yi, . . . , Y n _ i. For the binomial counting process, the next output is 
formed simply by adding a binary random variable to the old sum. 
Thus all of the conditional probability mass is concentrated on two 
values — the last value and the last value plus 1. The conditional 
pmf’s can therefore be expressed as 

PY n \Y n _ 1 ,...,Y 1 {yn\yn-l, • • • ,2/l) 

= Pr(y n = y n \Yi = yi\l = 1, • • • ,y n -i)) 

= Pr(X n = y n - y n -i\Yi = yv,l = 1, • • ■ ,y n - 1 )) (3.138) 

= Pi'(W = IJn - 2/n-ilV = yi,Xi = yi - y,_i; i = 2, 3, . . . , n - 1), 

since from the definition of the Y n process the conditioning event 
{Yi — yi\ i = 1, 2, . . . , n — 1} is identical to the event {X\ = y\,Xi = 
yi — yi-i] i = 2, 3, . . . , n — 1} and, given this event, the event Y n = y n 
is identical to the event X n = y n — y n - 1- I n words, the Y n will assume 
the given values if and only if the X n assume the corresponding 
differences since the Y n are defined as the sum of the X n . Now, 
however, the probability is entirely in terms of the given X{ variables, 
in particular, 

PY„\Y n - 1 ,...,Yi(yn\Un-l, • • • ,2/l) = 

PX n \X n - 1 ,...,X 2 ,X 1 (yn ~ Vn-AVn-l - Vn-2, ■ ■ • , 1)2 ~ 2/1, 2/1 ) • (3.139) 

So far the development is valid for any process and has not used the 
fact that the {X n { are iid If the {X n } are iid, then the conditional 
pmf’s are simply the marginal pmf’s since each X n is independent of 
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past Xfr] k <n\ Thus we have that 

PY„\Y n _ 1 ,...,Yi(yn\yn-l, • - • , 2/1 ) = Px{Vn ~ Vn-l) • (3.140) 

and hence from the chain rule the vector pmf is (defining yo — 0) 

n 

PY u ...,Y n (yi, ■ ■ ■ ,Vn ) = Y[px(Vi ~ Vi-l) , (3.141) 

i = 1 

providing the desired specification. 

To apply this formula to the special case of the binomial counting 
process, we need only plug in the binary pmf for px^° obtain the 
desired specification of the binomial counting process: 

n 

PY 1 ,...,Y n {yi, ■ ■ ■ ,Pn) = 1 - , 

i— 1 

where 

yi - yi -1 = 0 or l,i = 1,2, . . . ,n ; y 0 = 0 . (3.142) 

A similar derivation could be used to evaluate the conditional pmf 
for Y n given only its immediate predecessor as: 

PY^Y^iynlVn-l) = Pr (Y n = ynl^n-l = Vn-l) 

= Pr(X n = y n y n —\\Y n —\ = yn— l) • 

The conditioning event, however, depends only on values of Xj^ for 
k < n, and X n is independent of its past; hence 

PYulYn-AVnlVn-l) = Px{Vn ~ Vn-l) ■ (3.143) 

The same conclusion can be reached by the longer route of using the 
joint pmf for Yi, . . . ,Y n previously computed to find the joint pmf 
for Y n and Y n - 1 , which in turn can be used to find the conditional 
pmf. Comparison with (3.140) reveals that processes formed by sum- 
ming iid processes (such as the binomial counting process) have the 
property that 



PY n \Y n _ 1 ,...,Yi(yn\yn-l, ■ ■ ■ , 2/l) = PY^Y^ (VnlVn-l) (3.144) 
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or, equivalently, 

P^{Yn = yrSyi — Vi ] i = 1 , , n 1) = Pr(Y n = yn\Y n —i = Vn— l) 5 

(3.145) 

that is, they are Markov processes. Roughly speaking, given the most 
recent past sample (or the current sample), the remainder of the past 
does not affect the probability of what happens next. Alternatively 
stated, given the present, the future is independent of the past. 



3.16.3 ★ Discrete Random Walk 

As a second example of the preceding development, consider the 
random walk defined as in (3.135), i.e., by 




0 n = 0 

ELi x k n= 1 , 2 ,..., 



(3.146) 



where the iid process used has alphabet {1, —1} and Pr(X n = —1) = 
p. This is another example of an autoregressive process since it can 
be written in the form of a regression 



Y n — Y n - 1 + X n , n — 1 , 2 ,... 



(3.147) 



One can think of Y n as modeling a drunk on a path who flips a coin 
at each minute to decide whether to take one step forward or one step 
backward. In this case the transform of the iid random variables is 



Mx(ju) = (1 — p)e JU + pe JU , 

and hence using the binomial theorem of algebra we have that 

M Y n (ju) 

= ((1 - p)e ju + pe~ ju ) n 



n 



E 

k = 0 L 



n 

k 



(i - P )"-y 



Ju(n-2k) 



E 

k=—n , — n+2,...,n— 2,n 



n 



U-k (1 - p ){n+k)/2 p {n-k)/2 



Juk 
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Comparison of this formula with the definition of the characteristic 
function reveals that the pmf for Y n is given by 



PY n (k ) 



n 

n — k 
2 



(l — p)(n+k)/2p(n-k)/2 



k — — n, —n + 2, . . . , n — 2, n . 



Note that T n must be even or odd depending on whether n is even 
or odd. This follows from the nature of the increments. 



3. 16. 4 The Discrete Time Wiener Process 

Again consider a process formed by summing an iid process as in 
(3.135). This time, however, let {X n } be an iid process with zero- 
mean Gaussian marginal pdf’s and variance a 2 . Then the process 
{ Y n } defined by (3.135) is called the discrete time Wiener process. 
The discrete time continuous alphabet case of summing iid random 
variables is handled in virtually the same manner is the discrete time 
case, with conditional pdf’s replacing conditional pmf’s. 

The marginal pdf for Y n is given immediately by (3.121) as 
N (0, na 2 x ). 

To find the joint pdf’s we evaluate the pdf chain rule of (3.61): 

k - 1 

/yi,...,Y n (yi, ■■■ ,y n ) = II fYi\Y 1 ,...,Y l _ 1 (yi\yu ■ ■ -,yi-i)- ( 3 . 148 ) 

1=1 

To find the conditional pdf /y n |y 1 ,...,y n _ 1 (j/n|2/i > • • • , Vn-i) we com- 
pute the conditional cdf P(Y n < y n \Y n -i = y n -i\ i — 1, 2, . . . , n — 1). 
Analogous to the discrete case, from the representation of (3.135) 
and the fact that the X n are iid we have that 

P{Y n A y n \Y n —i = y n —i ; i — 1, 2, . . . , n 1) 

= P{ X n T yn Vn— l|^n — i = Vn—i'i i — 1, 2, . . . , 71 1) 

= P(X n <y n - y n - i) = Fx(y n ~ y n - 1 ), (3.149) 

and hence differentiating the conditional cdf to obtain the conditional 
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pdf yields 

d 

/y n |Yi,...,y n _i {yn\yi’) • • • ? Un— l) = F x{yn ~ Dn—l) = fx{yn ~ Un— l)> 

(3.150) 

the continuous analog of (3.140). Application of the pdf chain rule 
then yields the continuous analog to (3.141): 



n 

/n,...,Yn(yi> ■ ■ • , Vn—i) = 11 fx(yi - yi- 1 ) • (3.151) 

2=1 



Finally suppose that fx is Gaussian with zero mean and variance 
a 2 . Then this becomes 



fv<y n ) 



y\ 

2a 2 



n 



V2 



7 TCP 



n 

i = 2 



(i/i-3/i-i)' 

2^ 






7TCT 



( 27 TC 7 2 ) _ te“i^ ( ^= 2 ( ^-^- l)2 +^). 



(3.152) 



This proves to be a Gaussian pdf with mean vector 0 and a covari- 
ance matrix with entries Kx(Tn,n) = a 2 min(ra, n), m,n = 1 , 2 ,.... 
(Readers are invited to test their matrix manipulation skills and ver- 
ify this claim.) 

As in the discrete alphabet case, a similar argument implies that 



fYn\Y n -i(yn\yn-l) — fx{yn Vn— l) 
and hence from (3.150) that 

fy n \Y 1 ,...,Y n -i(yn\yi, • • • ,y n - 1) = fY n \Y n -i (2/n|S/n-l) • (3.153) 

As in the discrete alphabet case, a process with this property is 
called a Markov process. We can combine the discrete alphabet and 
continuous alphabet definitions into a common definition: a discrete 
time random process {Y n } is said to be a Markov process if the 
conditional cdf’s satisfy the relation 



^ > ^{Y n ^ yn\Yn—i — y n —i\ i — 1? 2, . . .) — Pr(Y n ^ yn\Yn— 1 — Vn— l) 

(3.154) 

for all y n _i,y n _ 2 , — More specifically, {Y n } is frequently called 
a first-order Markov process because it depends on only the most 
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recent past value. An extended definition to nth order Markov pro- 
cesses can be made in the obvious fashion. 



3.16.5 Hidden Markov Models 

A popular random process model that has proved extremely impor- 
tant in the development of modern speech recognition is formed by 
adding an iid process to a Markov process, so that the underlying 
Markov process is “hidden.” More generally, instead of adding an 
iid process one can require that the observed process is condition- 
ally independent given the underlying Markov process. Suppose for 
example that {X n } is a Markov process with either discrete or con- 
tinuous alphabet and that {W n } is an iid process, for example an 
iid Gaussian process. Then the resulting process Y n = X n + W n is 
an example of a hidden Markov model or, in the language of early 
information theory, a Markov source. A wide literature exists for es- 
timating the parameters of the underlying Markov source when only 
the sum process Y n is actually observed. A conditionally Gaussian 
hidden Markov model can be equivalently considered as viewing a 
Markov process through a noisy channel with iid Gaussian noise. 
Perhaps surprisingly, a hidden Markov source is not itself Markov. 
It is an example of a “conditionally independent” process since if 
the sequence of underlying states is known, the observed process is 
independent. 



3.17 ★Nonelement ary Conditional Probability 

Perhaps the most important form for conditional probabilities is the 
basic form of Pr (Y G F\X = x), a probability measure on a random 
variable Y given the event that another random variable X takes on 
a specific value x. We consider a general event Y G F and not simply 
Y — y since the latter is usually useless in the continuous case. In 
general, either or both Y or X might be random vectors. 

In the elementary discrete case, such conditional probabilities are 
easily constructed in terms of conditional pmf’s using (3.46): condi- 
tional probability is found by summing conditional probability mass 




3.17 'kNonelementary Conditional Probability 



195 



over the event, just as is done in the unconditional case. We pro- 
posed an analogous approach to continuous probability, but this does 
not lead to a useful general theory. For example, it assumes that the 
various pdf’s all exist and are well behaved. As a first step towards 
a better general definition (which will reduce in practice to the con- 
structive pdf definition when it makes sense), we derive a variation 
of (3.46). Multiply both sides of (3.46) by px{x) and sum over an 
X-event G to obtain 

E p w e F \ x = *)»(*) = E E PY\x{y\x)px{x) 

xeG xeGyeF 

= Y Y' px -y ( ' x -’ y "> 

xeGyeF 

= P(X e G,Y e F) 

= Px,y{G x F); VG. (3.155) 

This formula discribes the essence of the conditional probability 
by saying what it does: For any X event G, summing the product of 
the conditional probability that Y G F and the marginal probability 
that X = x over all x G G yields the joint probability that X E G 
and Y G F. If our tentative definition of nonelementary conditional 
probability is to be useful, it must play a similar role in the con- 
tinuous case, that is, we should be able to average over conditional 
probabilities to find ordinary joint probabilities, where now averages 
are integrals instead of sums. This indeed works since 




Thus the tentative definition of nonelementary conditional probabil- 
ity of (3.52) behaves in the manner that one would like. Using the 
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Stieltjes notation we can combine (3.155) and (3.156) into a single 
requirement: 



[ P(Y g F\X 

Jg 



X ) dFx{x) = P{X eG,Y eF) 

= Px,y(G x F);VG, (3.157) 



which is valid in both the discrete case and in the continuous case 
when one has a conditional pdf. In advanced probability, (3.157) is 
taken as the definition for the general (nonelementary) conditional 
probability P(Y G F\X = x); that is, the conditional probability is 
defined as any function of x that satisfies (3.157). This is a descriptive 
definition which defines an object by its behavior when integrated, 
much like the rigorous definition of a Dirac delta function is by its 
behavior inside an integral. This reduces to the given constructive 
definitions of (3.46) in the discrete case and (3.52) in the continuous 
case with a well behaved pdf. It also leads to a useful general theory 
even when the conditional pdf is not well defined. 

Lastly, we observe that elementary and nonelementary conditional 
probabilities are related in the natural way. Suppose that G is an 
event with nonzero probability so that the elementary conditional 
probability P{Y G F\X G G) is well defined. Then 

Px 

g F\X e G) = — 

= j P{Y G F\X = x ) dF x {x). 





3.18 Problems 

1. Given the probability space (Sft, S(Sft)), ra), where m is the probability 
measure induced by the uniform pdf / on [0, 1] (that is, f(r) = 1 for 
r G [0,1] and is 0 otherwise), find the pdf’s for the following random 
variables defined on this space: 

(a) X(r) = |r| 2 , 

(b) Y(r) = r 1 ' 2 , 

(c) Z(r) = In |r| , 

(d) V(r) = ar + b , where a and b are fixed constants. 
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(e) Find the pmf for the random variable W(r) = 3 if r > 2 and W{r) = 
1 otherwise. 

2. Do problem 3.1 for an exponential pdf on the original sample space. 

3. Do problem 3.1(a)-(d) for a Gaussian pdf on the original sample space. 

4. A random variable X has a uniform pdf on [0, 1]. What is the probability 
density function for the volume of a cube with sides of length XI 

5. A random variable X has a cumulative distribution function Fx{cP). 
What is the cdf of the random variable Y = aX + 6, where a and b are 
constants? 

6. Use the properties of probability measures to prove the following facts 
about cdf s: If F is the cdf of a random variable, then 

(a) F(— oo) = 0 and F(oo) = 1. 

(b) F(r) is a monotonically nondecreasing function, that is, if x > y, 
then F{x) > F{y). 

(c) F is continuous from the right, that is, if e n , n = 1,2,... is a sequence 
of positive numbers decreasing to zero, then 

lim F(r + e n ) = F(r) . 

n— oo 

Note that continuity from the right is a result of the fact that we 
defined a cdf as the probability of an event of the form (— oo,r]. If 
instead we had defined it as the probability of an event of the form 
(—oo,r) (as is often done in Eastern Europe), then cdf's would be 
continuous from the left instead of from the right. When is a cdf 
continuous from the left? When is it discontinuous? 

7. Say we are given an arbitrary cdf F for a random variable and we would 
like to simulate an experiment by generating one of these random vari- 
ables as input to the experiment. As is typical of computer simulations, 
all we have available is a uniformly distributed random variable U ; that 
is, U has the pdf of 3.1. This problem explores a means of generating 
the desired random variable from U (this method is occasionally used in 
computer simulations). Given the cdf F, define the inverse cdf F~ l (r) 
as the smallest value of x G 9? for which F(x) > r. We specify “small- 
est” to ensure a unique definition since F may have the same value for 
an interval of x. Find the cdf of the random variable Y defined by 
Y = F~ 1 (U). 

Suppose next that A is a random variable with cdf Fx(ot). What is 
the distribution of the random variable Y = Fx(X)7 This mapping is 
used on individual picture elements (pixels) in an image enhancement 
technique known as “histogram equalization” to enhance contrast. 
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8. You are given a random variable U described by a pdf that is 1 on [0,1]. 
Describe and make a labeled sketch of a function g such that the random 
variable Y = g(U) has a pdf \e~ Xx ; x > 0. 

9. A probability space (Q,A, P) models the outcome of rolling two fair 
four-sided dice on a glass table and reading their down faces. Hence 
we can take Q = {1,2,3,4} 2 , the usual event space (the power set or, 
equivalently, the Borel field), and a pmf placing equal probability on all 
16 points in the space. On this space we define the following random 
variables: W{uj) = the down face on die #1; that is, if u = (^ 1 ,^ 2 ), 
where cdi denotes the down face on die # i, then W{uj = uj\. (We could 
use the sampling function notation here: W = Yl 1 ) Similarly, define 
V{uj) = CJ 2 , the down face on the second die. Define also X{uj) = uj\ + 
0 J 2 1 the sum of the down faces, and Y(uj) = (^ 2 ^ 2 , the product of the 
down faces. Find the pmf and cdf for the random variables A, Y, W, and 
V. Find the pmf’s for the random vectors (A, Y) and (W, V). Write a 
formula for the distribution of the random vector (IF, V) in terms of its 
pmf. 

Suppose that a greedy scientist has rigged the dice using magnets to 
ensure that the two dice always yield the same value; that is, we now have 
a new pmf on Q that assigns equal values to all points where the faces 
are the same and zero to the remaining points. Repeat the calculations 
for this case. 

10. A random vector (A, Y) has a pmf px,y(k,j) = P { A = k and Y = j) 
of px, Y (k,j) = I if (k,j) = (0,3), (0,1) ,’(2,1), (2,2), (4,0), (4,1), (4,2), 
(4,3), or (4,3) and is 0 otherwise. 

(a) Find the pmf py{y) an d the conditional pmf px\Y ( x \ y)- Are A and 
Y independent? 

(b) Find and sketch the pmf for the random variable R = min (A, Y). 
(The sketch must be labeled.) 

11. Jeff draws cards from a fair 52 card deck without replacing a card after 
he draws it. Let the random variable A be the number of cards he draws 
until (and including) the first draw of a heart. 

Evaluate the following probabilities: 

- P(X = 1) 

- P(X = 2) 

- P{X = 52) 

More generally, find the pmf for px(k ) = P(X = k), k = 1 ... 52. 

12. A biased 4 sided die is rolled and the down face is a random variable N 
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described by the following pmf: 



Pn(ti) 




n=l,2,3,4 

otherwise 



Given the random variable N a biased coin with bias 1S flipped and 
the random variable X is 1 or zero according to whether the coin shows 
heads or tails, i.e., the conditional pmf is 

/ i \ . U-(-l n + 1 x n 

Px\N{x\n) = (-^“) (! - ~ 2 ^~) 5 x = 0, 1. 

(a) Find the expectation E(N) and variance <j 2 n of N. 

(b) Find the conditional pmf PN\x( n \ x )- 

(c) Find the conditional expectation E(N\X = 1), i.e., the expectation 
with respect to the conditional pmf Pat | x(n|l). 

(d) Find the conditional variance of N given X = 1. 

(e) Define the event F as the event that the down face of the die is 1 or 
4. Are the events F and {X = 1} independent? 

13. In a certain region of west Texas, there are an average of 10 armadillos 
per square mile. Denote by N the number of armadillos in an area A. 
Assume that N is described by a Poisson pmf 

, , e~ XA (\ A) n „ „ „ 

PN(n) = : , n = 0 , 1 , 2 ,... 

n\ 

for some constant A. How large of a circular region should be selected 
to ensure that the probability of finding at least 1 armadillo is at least 
0.95? 

14. Suppose that X is a binary random variable with outputs {a, b} with a 
pmf px (a) = p and px (b) = 1 — p and Y is a random variable described 
by the conditional pdf 

e -(y-x) 2 /z°w 

fy\x(y\x) = f — 7r . 

Zna-w 

Describe the MAP detector for X given Y and find an expression for the 
probability of error in terms of the Q function. 

Suppose that p = 0.5, but you are free to choose a and b subject only to 
the constraint that (a 2 + b 2 ) / 2 = E\>. Which is a better choice, a = —b or 
a nonzero with 5 = 0? What can you say about the minimum achievable 
p ? 

15. The famous ubiquitous operating system defenestration is run on 2 x 
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10 8 computers. For each of the mutually independent computers, the 
probability mass function for X, the number of operating system crashes 
in a day, is given by 

Px (k) = k = 0,1, 2, 3. 

On a day when for a given computer the operating system crashes X = k 
times, the user has a probability of 1 — 2~ k of reinstalling the operating 
system. 

(a) Find the mean EX and variance o\ of X. 

(b) Find the mean and variance of IF, the total number of operating 
system crashes among all of the computers on a given day. 

(c) Find the probability that a particular computer has its operating 
reinstalled on a given day. 

(d) Find the conditional probability for the number of crashes for a par- 
ticular computer given that the operating system was reinstalled on 
that computer on that day. 

(e) In a given group of 10 computers, what is the probability that exactly 
three of them had their operating systems reinstalled on a particular 
day? 

16. Two random variables X and Y have uniform probability density func- 
tions on (0, 1) and they are independent. Find the probability density 
function fw{ w ) for the random variable W = (X — Y) 2 and find the 
mean of IF, E(W). 

17. John and Mark are going to play a game. John will draw a number X 
using an exponential distribution with parameter A, that is 

fx(x) = \e~ Xx , x > 0 

At the same time, Mark will independently draw a number Y using a 
Poisson distribution with parameter A, that is 

Pv{y) = e ~ x —ri V = 0,1,2,... 

y' 

If John’s number is larger than Mark’s, they draw again. Otherwise, the 
game stops. 

(a) Evaluate the probability P(X > Y). 

(b) What is the expected number of draws until the game stops? 

18. Consider the two-dimensional probability space ($ft 2 , £>(3?) 2 , P), where P 
is the probability measure induced by the pdf g, which is equal to a 
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constant c in the square {(x,y) : x G [—1/2, 1/2], y G [—1/2, 1/2]} and 
zero elsewhere. 

(a) Find the constant c. 

(b) Find P({x,y : x < y}). 

(c) Define the random variable U : 5ft 2 — > 5ft by U(x, y) = x + y. Find an 
expression for the cdf Fu(u) = Pr(£7 < u). 

(d) Define the random variable V : 5ft 2 — > 5ft by V{x,y) = xy. Find the 
cdf Fv(v). 

(e) Define the random variable IF : 5ft 2 — > 5ft by W(x,y) = ma x(x, y), 
that is, the larger of the two coordinate values. Thus ma x(x,y) = x 
if x > y. Find the cdf Fw(w ;). 

19. Suppose that A and Y are two random variables described by a pdf 

fxAx,y) = Ce~ x2 -y 2+ *y. 

(a) Find C. 

(b) Find the marginal pdf’s fx and fy. Are X and Y independent? Are 
they identically distributed? 

(c) Define the random variable Z = X — 2 Y. Find the joint pdf fx.z- 

20. Let (A, F) be a random vector with distribution Px,y induced by the 
pdf fx,v{x,y) = fx{x)fv(y), where 

fx(x ) = fv{x) = \e~ Xx ; x > 0 , 

that is, (A, Y) is described by a product pdf with exponential compo- 
nents. 

(a) Find the pdf for the random variable U = A + Y. 

(b) Let the “max” function be defined as in problem 3.18 and define the 
“min” function as the smaller of two values; that is, min (x,y) = x 
if x < y. Define the random vector (IF, V ) by IF = min (A, Y) and 
V = max(A, Y). Find the pdf for the random vector (IF, F). 

21. Let (A, Y) be a random vector with distribution Px,y induced by a 
product pdf fx,Y(x,y) = fx(x)fy(y) with fx(x) = fy(y) equal to the 
Gaussian pdf with m = 0. Consider the random vector as representing 
the real and imaginary parts of a complex- valued measurement. It is 
often useful to consider instead a magnitude-phase representation vector 
(i?, 0), where R is the magnitude (A 2 + F 2 ) 1 / 2 and 0 = tan _1 (F/A) 
(use the principal value of the inverse tangent). Find the joint pdf of the 
random vector (i?, 0. Find the marginal pdf’s of the random variables R 
and 0. The pdf of R is called the Rayleigh pdf. Are R and 0 independent? 

22. A probability space (Q,A, P) is defined as follows: D consists of all 8- 
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dimensional binary vectors, e.g., every member of D has the form uj = 
(cjo, • . • where uji is 0 or 1. T is the power set, P is described by 

a pmf which assigns a probability of 1/2 8 to each of the 2 8 elements in 
D ( a uniform pmf). 

Find the pmf’s describing the following random variables: 

(a) g{uj) = 1 i- e -? number of l’s in the binary vector. 

(b) X(c j) = 1 if there are an even number of l’s in uj and 0 otherwise. 

(c) Y{u) = ujj, i.e., the value of the jth coordinate of uj. 

(d) Z(yj) = max^(^). 

(e) V{uj) = g{uj)X{uj), where g and X are as above. 

23. Suppose that (Xo, Xi, . . . , Xjy) is an iid random vector with marginal 
pdf’s 

f ( ) - f 1 0<a<l 
(o otherwise. 

Define the following random variables: 

— U = Xq 

— V = max(X l3 X 2 , X 3 , X 4 ) 



W = 




if Xi > 2X 2 
otherwise 



— A random vector Y = (Yi, . . . , Y/v) is defined by 



Y n — X n + X n _i; n — 1, . . . , N. 



(a) Find the pdf or pmf as appropriate for [/, Y, and W. 

(b) Find the cumulative distribution function (cdf) for Y n . 

24. Let / be the uniform pdf on [0,1]. Let (X, Y) be a random vector 
described by a joint pdf 

fx,v{x, y) = f(y)f(x - y) all x, y . 

(a) Find the marginal densities fx and /y . Are X and Y independent? 

(b) Find P(X > 1/2|Y < 1/2). 

25. In example [3.24], find the pmf for the random variable X n for a fixed 
n. Find the pmf for the random vector (X n , X&) for fixed n and k. Con- 
sider both the cases where n = k and where n ^ k. Find the probability 
Pr(X 5 = X 12 ). 
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26. Let X and Y be two random variables with joint pmf 

p XY (k, j) = C-L-; j = !,■■■ ,N- k=l,2, 
3 + 1 



J- 



(a) Find C. 

(b) Find py(j). 

(c) Find Px\y(k\j). Are X and Y independent? 

(d) Find E[l/Y\. 

27. In example [3.27] of the random phase process, find Pr(X(t) > 1/2). 

28. Evaluate the pmf py(t)(y) f° r the quantized process of example [3.28] 
for each possible case. (Choose b = 0 if the process is nonnegative and 
b = —a otherwise.) 

29. Let ([0, 1 ], S([0, 1 ]), P) be a probability space with pdf f(uj) = 1 ; uj G 
[0, 1]. Find a random vector { X t ; t G (1, 2, . . . , n}} such that Pr (X t = 
1) = Pr(X t = 0) = 1/2 and Pr(X* = 1 and X t ~i = 1) = 1/8, for relevant 
t. 

30. Give an example of two equivalent random variables (that is, two random 
variables having the same distribution) that 

(a) are defined on the same space but are not equal for any w G O, 

(b) are defined on different spaces and have different functional forms. 

31. Let (5ft, B (5ft), m) be the probability space of problem 3.1. 

(a) Define the random process {X(t); t G [0,oo)} by 



X(t,w) = 



1 if 0 < t < cj 
0 otherwise . 



Find Pr (X(t) = 1) as a function of t. 

(b) Define the random process {X(t); t G [0,oo)} by 



X(t,w) = 



t/uj if 0 < t < lo 
0 otherwise . 



Find Pr (X(t) > x) as a function of t for x G (0, 1). 

32. Two continuous random variables X and Y are described by the pdf 



fx,v(x,y) = 



c if \x\ + \y\ < r 
0 otherwise . 



where r is a fixed real constant and c is a constant. In other words, the 
pdf is uniform on a square whose side has length \/2r. 

(a) Evaluate c in terms of r. 

(b) Find fx(x). 
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(c) Are X and Y independent random variables? (Prove your answer.) 

(d) Define the random variable Z = (|X| + |Y|). Find the pdf fz(z). 

33. Find the pdf of X(t) in example [3.23] as a function of time. Find the 
joint cdf of the vector (X(1),X(2)). 

34. Richard III wishes to trade his kingdom for a horse. He knows that the 
probability that there are k horses within r feet of him is 

r 2 k p -Hr 2 

CH k ; fc = 0, 1, 2, • • • , 

where H > 0 is a fixed parameter. 

(a) Let R denote a random variable giving the distance from Richard to 
the nearest horse. What is the probability density function /^(a) 
for R? ( C should be evaluated as part of this question.) 

(b) Rumors of the imminent arrival of Henry Tudor have led Richard to 
lower his standards and consider alternative means of transportation. 
Suppose that the probability density function fs((3) for the distance 
S to the nearest mule is the same as /r except that the parameter H 
is replaced by a parameter M. Assume that R and S are independent 
random variables. Find an expression for the cumulative distribution 
function (cdf) for W , the distance to the nearest quadruped (i.e., 
horse or mule). 

Hint : If you did not complete or do not trust your answer to part 
(b), then find the answer in terms of the cdf's for R and S. 

35. Suppose that a random vector X = (Xo, . . . , X&_i) is iid with marginal 
pmf 



PXi(l) =Px(l) 



P 



if / = 1 



1 ~P 



if / = 0 



for all i. 

(a) Find the pmf of the random variable Y = n-:d v. 

(b) Find the pmf of the random variable W = Xo + Xk-i- 

(c) Find the pmf of the random vector (Y, W). 

36. Find the joint cdf of the complex components of X n (cj) in example [3.25] 
as a function of time 

37. Find the pdf of X(t) in example [3.27]. 

38. A certain communication system outputs a discrete time series {X n } 
where X n has pmf pxO-) — Px{~ 1) = 1/2. Transmission noise in the 
form of a random process {Y n } is added to X n to form a random process 
{Z n = X n + Y n }. Y n has a Gaussian distribution with m = 0 and a = 1. 
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(a) Find the pdf of Z n . 

(b) A receiver forms a random process {R n — sgn(Z n } where sgn is the 
sign function sgn(x) = 1, if x > 0, sgn(x) = — 1, if x < 0. R n is out- 
put from the receiver as the receiver’s estimate of what was trans- 
mitted. Find the pmf of R n and the probability of detection (i.e., 
Pr(i? n = A n )). 

(c) Is this detector optimal? 

39. If A is a Gaussian random variable, find the marginal pdf fy{t) for the 
random process Y (t) defined by 

Y(t) = A cos(27r/o£) ; t G 9? , 

where /o is a known constant frequency. 

40. Let A and Z be the random variables of problems 3.1 through 3.3. For 
each assumption on the original density find the cdf for the random 
vector (A, Z), Fx,z{x,z). Does the appropriate derivative exist? Is it a 
valid pdf? 

41. Let A be a random variable giving the number of molecules of hydrogen 
in a spherical region of radium r and volume V = 47rr 3 /3. Assume that 
N is described by a Poisson pmf 

e~ pV (pV) n 

p N {n) = n = 0,1,2,... 

n\ 

where p can be viewed as a limiting density of molecules in space. Say we 
choose an arbitrary point in deep space as the center of our coordinate 
system. Define a random variable X as the distance from the origin 
of our coordinate center to the nearest molecule. Find the pdf of the 
random variable A, fx{x). 

42. Let V be a random variable with a uniform pdf on [0, a] . Let ID be a ran- 
dom variable, independent of V, with an exponential pdf with parameter 
A, that is, 

fw(w) = \e~ Xw ; w G [0, oo) . 

Let p(t) be the pulse with value 1 when 0 < t < 1 and 0 otherwise. Define 
the random process {A (£); t G [0,oo)} by 

X(t) = Vp(t-W) , 

(This is a model of a square pulse that occurs randomly in time with 
a random amplitude.) Find for a fixed time t > 1 the cdf Fx(t)( a ) = 
Pr(A(£) < a). You must specify the values of the cdf for all possible real 
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values a. Show that there exists a pmf p with a corresponding cdf Fi, a 
pdf / with a corresponding cdf F \ , a pdf / with a corresponding cdf F 2 , 
and a number f3 t £ (0,1) such that 

Fx(t)(at) = PtFi(a) + (1 - /3 t )F 2 (a) . 

Give expressions for p, /, and Bt. 

43. Prove the following facts about characteristic functions: 

(a) \M x (ju)\ < 1 

(b) M x ( 0) = 1 

(c) \M x {ju)\ < M x { 0) = 1 

(d) If a random variable X has a characteristic function Mx{ju ), if c is a 
fixed constant, and if a random variable Y is defined by Y = X + c, 
then 



M Y (ju ) = e JUC M x (ju) . 

44. Suppose that X is a random variable described by an exponential pdf 

fx(p) = Ae~ Ac *; a > 0. 

(A > 0.) Define a function q which maps nonnegative real numbers into 
integers by q(x) = the largest integer less than or equal to x. In other 
words 



q(x) = kifk<x<k + l, k = 0, 1, • • • . 

(This function is often denoted by q{x) = \_x\.) The function q is a form 
of quantizer, it rounds its input downward to the nearest integer below 
the input. Define the following two random variables: the quantizer 
output 



y = Q(X) 



and the quantizer error 



e = X-q(X). 

Note: By construction e can only take on values in [0, 1). 

(a) Find the pmf py{k) for Y. 

(b) Derive the probability density function for e. (You may find the 
“divide and conquer” formula useful here, e.g., P(G) = J2i P(G H 
Fi ), where {Fi} is a partition.) 
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45. Suppose that (Xi, . . . , Xw) is a random vector described by a product 
pdf with uniform marginal pdf’s 



fx„ (a) 



i M < \ 

0 otherwise. 



Define the following random variables: 

-U = X$ 

-V = min(Xi,X 2 ) 

— W = n if n is the smallest integer for which X n >1/4 and W = 0 if 
there is no such n. 

(a) Find pdf’s or pmf’s for £/, V, and W. 

(b) What is the joint pdf 

46. The joint probability density function of X and Y is 



a\ < 1, 0 < [3 < 1. 

Y 

~XP 

(U is taken to be 0 if X = 0.) 

(a) Find the constant C and the marginal probability density functions 
fx(a) and 

(b) Find the probability density function fu{l) f° r U. 

(c) Suppose that U is quantized into q(U) by defining 

q(U) = i for < U < z = 1,2, 3, 

where the interval [do,ds) equals the range of possible values of U. 
Find the quantization levels i — 0,1, 2, 3 such that q(U) has a 
uniform probability mass function. 

(d) Find the expectations E(X S U) and E(q(U)). 

47. Let (X, Y) be a random vector described by a product pdf fxy(%,y) = 
fx(x)fy(y)- Let Fx and Fy denote the corresponding marginal cdf’s. 
(a) Prove 

/ OO nOC 

F Y (x)f x (x)dx = 1 - / f Y {x)F x {x)dx 

-oo J — OO 

(b) Assume, in addition, that X and Y are identically distributed, i.e., 
have the same pdf. Based on the result of (a) calculate the proba- 



fx,r(a,P) = C, 
Define a new random variable 
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bility P(X > Y). {Hint: You should be able to derive or check your 
answer based on symmetry.) 

48. You have 2 coins and a spinning pointer U. The coins are fair and 
unbiased, and the pointer U has a uniform distribution over [0, 1). You 
flip both coins and spin the pointer. A random variable X is defined as 
follows: 

If the first coin is “heads,” then: 

X _ ( 1 if the 2nd coin is “heads” 

\ 0 otherwise 

If the first coin is “tails,” then X = U + 2. 

Define another random variable: 

y _ f 2U if the 1st coin is “heads” 

\ 2U + 1 otherwise 



(a) Find Fx(x). 

(b) Find Pr(| < X < 2). 

(c) Sketch the pdf of Y and label important values. 

(d) Design an optimal detection rule to estimate U if you are given only 
Y. What is the probability of error? 

(e) State how to, or explain why it is not possible to: 

i. Generate a binary random variable Z, pz( 1) = p, given U ? 

ii. Generate a continuous, uniformly distributed random variable given 

Z? 

49. The random vector W = (Wo, W 2 ) is described by the pdf 
fw(x,y,z) = C\z\, for x 2 + y 2 < 1, \z\ < 1. 

(a) Find C. 

(b) Determine whether the following variables are independent and jus- 
tify your position: 

i. Wo and W\ 

ii. Wo and W 2 

iii. Wi and W 2 

iv. Wo and W\ and W 2 

(c) Find Fi{W 2 > |). 

(d) Find Fw 0 ,w 2 { 0,0). 

(e) Find the cdf of the vector W. 

(f) Let V = n f =0 Wi. Find Pr(Y > 0). 

(g) Find the pdf of M, where M = min(IF r 1 2 + W 2 , ). 
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50. Suppose that X and Y are random variables and that the joint pmf is 



Px, Y (k,j)=c 2- fc 2«- fe ); fc = 0, 1, 2, • • • ; j = k,k + 1, - ■ ■ . 

(a) Find c. 

(b) Find the pmf’s px(j) and PyU)- 

(c) Find the conditional pmf’s px\y(k\j) and py\x(j\k)- 

(d) Find the probability that Y > 2X. 

51. Suppose that X = (Xo, Xi, ... , Xk-i) is a random vector {k is some large 
number) with joint pdf 

if 0 < Xi < 1; i = 0, . — 1 

/X(X) = 0 else 



Define the random variables V = X$ + Xio and W = max(Xo, Xio). 
Define the random vector Y : 



Y n = 2 n X n ; n = 0, . . . ,k — 1. 



(a) Find the joint pdf 

(b) Find the probabilities Pr (W < 1/2), Pr(P < 1/2), and Pr (W < 
1/2 and V < 1/2). 

(c) Are W and V independent? 

(d) Find the (joint) pdf for Y. 

52. The random process described in example [ 3 . 26 ] is an example of a class 
of processes that is currently somewhat of a fad in scientific circles, it 
is a chaotic. (See, e.g., Chaos by James Gleick (1987).) Suppose as in 
Example [ 3 . 26 ] Xo(cj) = cj is chosen at random according to a uniform 
distribution on [0, 1), that is, the pdf is 



fx o (a) 



1 if a G [0, 1) 
0 else . 



As in the example, the remainder of the process is defined recursively by 



X n (i S) = 2X n _i(c j) mod 1, n = 1, 2, • • • . 



Note that if the initial value Xo is known, the remainder of the process 
is also known. 

Find a nonrecursive expression for X n (uj), that is, write X n (c j) directly 
as a function of cu, e.g., X n (co) = g(uS) mod 1. 

Find the pdf fx 1 (a) and fx n (a). 

Hint: after you have found fx 1 , try induction. 
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53. Another random process which resembles that of the previous process 
but which is not chaotic is to define Xq in the same way, but define X n 

by 

X n (u) = (X n -i(u) + Xo(cj)) mod 1. 

Here X\ is equivalent to that of the previous problem, but the subsequent 
X n are different. As in the previous problem, find a direct formula for 
X n in terms of uj (e.g., X n (uj) = h(uS) mod 1) and find the pdf fx n (pt)- 

54. The Mongol general Subudai is expecting reinforcements from Chenggis 
Kahn before attacking King Bela of Hungary. The probability mass 
function describing the number N of tumens (units of 10,000 men) that 
he will receive is 



p N (k) = cp k -, k = 0 , 1 ,- • • . 



If he receives N = k tumens, then his probability of losing the battle 
will be 2~ k . This can be described by defining the random variable W 
which will be 1 if the battle is won, 0 if the battle is lost, and defining 
the conditional probability mass function 

Pw\N{ m \k) — Pr(W = m\N = k) = 



2 k m = 0 
1 — 2~ k m= 1 . 



(a) Find c. 

(b) Find the (unconditional) pmf pw{m), that is, what is the probability 
that Subudai will win or lose? 

(c) Suppose that Subudai is informed that definitely N < 10. What is 
the new (conditional) pmf for N ? (That is, find Pr(AT = k\N < 10).) 

55. Suppose that {X n ; n = 0, 1,2, * * * } is a binary Bernoulli process, that is, 
an iid process with marginal pmf’s 



Px n (k) 



p if k = 1 
1 — p if k = 0 



for all n. Suppose that {W n \ n = 0, 1, • • • } is another binary Bernoulli 
process with parameter e, that is, 



Pw n (k) 



e if k = 1 
1 — e if k = 0 



We assume that the two random processes are completely independent of 
each other (that is, any collection of samples of X n is independent from 
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any collection of W n ). We form a new random process {Y n ; n = 0, 1, • • • } 
by defining 

Yn = X n © W n , 

where the ® operation denotes mod 2 addition. This setup can be 
thought of as taking an input digital signal X n and sending it across 
a binary channel to a receiver. The binary channel can cause an error 
between the input X n and output Y n with probability e. Such a commu- 
nication channel is called an additive noise channel because the output 
is the input plus an independent noise process (where “plus” here means 
mod 2). 

(a) Find the output marginal pmf py n (k). 

(b) Is {F n } Bernoulli? That is, is it an iid process? 

(c) Find the conditional pmf py n \x n (j\k)- 

(d) Find the conditional pmf Px n \Y n {k\j)- 

(e) Find an expression for the probability of error Pr (Y n ^ X n ). 

(f) Suppose that the receiver is allowed to think about what the best 
guess for X n is given it receives a value Y n . In other words, if you 
are told that Y n = j, you can form an estimate or guess of the input 

A 

X n by some function of j, say X(j). Given this estimate your new 
probability of error is given by 

Pe=Pr{X(Y n )^X n ). 

A 

What decision rule X(j) yields the smallest possible P e ? What is 
the resulting P e l 

56. Suppose that we have a pair of random variables (X, Y) with a mixed dis- 
crete and continuous distribution. Y is a binary {0, 1} random variable 
described by a pmf py (1) =0.5. Conditioned on T = y, X is continuous 
with a Gaussian distribution with mean a 2 and mean ?/, that is, 

fx\v(x\y)(x\y) = J— e ~^ (x ~ v)2 ; y = 0 , 1 . 

V 27 T<J Z 

This can be thought of as the result of communicating a binary symbol 
(a “bit” ) over a noisy channel, which adds 0 mean variance a 2 Gaussian 
noise to the bit. In other words, X = Y + IF, where IF is a Gaussian 
random variable, independent of Y. What is the optimum (minimum 
error probability) decision for Y given the observation X? Write an 
expression for the resulting error probability. 

57. Find the multidimensional Gaussian characteristic function of equation 
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(3.123) by completing the square in the exponent of the defining multi- 
dimensional integral. 
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4.1 Averages 

In engineering practice we are often interested in the average behav- 
ior of measurements on random processes. The goal of this chapter is 
to link the two distinct types of averages that are used — long-term 
time averages taken by calculations on an actual physical realization 
of a random process and averages calculated theoretically by prob- 
abilistic averages at some given instant of time, averages that are 
called expectations. As we shall see, both computations often (but 
by no means always) give the same answer. Such results are called 
laws of large numbers or ergodic theorems. 

At first glance from a conceptual point of view, it seems unlikely 
that long-term time averages and instantaneous probabilistic aver- 
ages would be the same. If we take a long-term time average of a 
particular realization of the random process, say {X(t,uj o); t £ T}, 
we are averaging for a particular uj — an uj which we cannot know 
or choose; we do not use probability in any way and we are ignoring 
what happens with other values of uj. Here the averages are com- 
puted by summing the sequence or integrating the waveform over t 
while uj§ stays fixed. If, on the other hand, we take an instantaneous 
probabilistic average, say at the time to, we are taking a probabilistic 
average and summing or integrating over uj for the random variable 
X(t$,uj). Thus we have two averages, one along the time axis with 
uj fixed, the other along the uj axis with time fixed. It seems that 
there should be no reason for the answers to agree. Taking a more 
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practical point of view, however, it seems that the time and proba- 
bilistic averages must be the same in many situations. For example, 
suppose that you measure the percentage of time that a particular 
noise voltage exceeds 10 volts. If you make the measurement over 
a sufficiently long period of time, the result should be a reasonably 
good estimate of the probability that the noise voltage exceeds 10 
volts at any given instant of time — a probabilistic average value. 

To proceed further, for simplicity we concentrate on a discrete al- 
phabet discrete time random process. Other cases are considered 
by converting appropriate sums into integrals. Let {X n } be an ar- 
bitrary discrete alphabet discrete time process. Since the process 
is random, we cannot predict accurately its instantaneous or short- 
term behavior — we can only make probabilistic statements. Based 
on experience with coins, dice, and roulette wheels, however, one ex- 
pects that the long-term average behavior can be characterized with 
more accuracy. For example, if one flips a fair coin, short sequences 
of flips are unpredictable. However, if one flips long enough, one 
would expect to have an average of about 50% of the flips result 
in heads. This is a time average of an instantaneous function of a 
random process — a type of counting function that we will consider 
extensively. It is obvious that there are many functions that we can 
average, i.e., the average value, the average power, etc. We will pro- 
ceed by defining one particular average, the sample average value of 
a random process, which is formulated as 

n— 1 

Sn = n~ l Xj ; n= 1,2,3,... 

i = 0 

We will investigate the behavior of S n for large n, i.e., for a long- 
term time average. Thus, for example, if the random process {X n } 
is the coin-flipping model, the binary process with alphabet {0,1}, 
then S n is the number of l’s divided by the total number of flips - 
the fraction of flips that produced a 1. As noted before, S n should 
be close to 50% for large n if the coin is fair. 

Note that, as in example [3.7], for each n, S n is a random variable 
that is defined on the same probability space as the random process 
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{X n }. This is made explicit by writing the uj dependence: 

^ n— 1 

S n (u>) = -J2x k (u) . 

k = 0 

In more direct analogy to example [3.7], we can consider the {X n } 
as coordinate functions on a sequence space, say m), 

where m is the distribution of the process, in which case S n is defined 
directly on the sequence space. The form of definition is simply a 
matter of semantics or convenience. Observe, however, that in any 
case {5 n ; n— 1, 2, . . .} is itself a random process since it is an indexed 
family of random variables defined on a probability space. 

For the discrete alphabet random process that we are considering, 
we can rewrite the sum in another form by grouping together all 
equal terms: 

Sn(w) = ar i n) M (4-1) 

aeA 

where A is the range space of the discrete alphabet random vari- 
able X n and r^\ui) — n~ l [number of occurrences of the letter 
a in {Xi(u), i = 0, 1, 2, . . . , n — 1}]. The random variable is 
called the n th — order relative frequency of the symbol a. For the 
binary coin flipping example we have considered, A = {0,1}, and 
S n (cu) = r^ n \cj), the average number of heads in the first n flips. In 
other words, for the binary coin- flipping example, the sample aver- 
age and the relative frequency of heads are the same quantity. More 
generally, the reader should note that ; can always be written as 
the sample average of the indicator function for a, l a (x): 

n— 1 

= n - 1 J 2 laPQ) , 

i = 0 

where 



f 1 if x — a 
I 0 otherwise. 



Note that l| a j is a more precise, but more clumsy, notation for the 
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indicator function of the singleton set {a}. We shall use the shorter 
form here. 

Let us now assume that all of the marginal pmf’s of the given 
process are the same, say px(x), x G A. Based on intuition and 
gambling experience, one might suspect that as n goes to infinity, 
the relative frequency of a symbol a should go to its probability of 
occurrence, px ( a )• To continue the example of binary coin flipping, 
the relative frequency of heads in n tosses of a fair coin should tend 
to ^ as n — > co. If these statements are true, that is, if in some sense, 

r ( a ] -> Px (a) , (4.2) 

n — *oo 

then it follows that in a similar sense 

S n — y, Q-'Px(o) , (4.3) 

n— »oo L — ' 
aeA 

the same expression as (4.1) with the relative frequency replaced by 
the pmf. The formula on the right is an example of an expectation of 
a random variable, a weighted average with respect to a probability 
measure. The formula should be recognized as a special case of the 
definition of expectation of (2.34), where the pmf is px and g{x) = x, 
the identity function. The previous plausibility argument motivates 
studying such weighted averages because they will characterize the 
limiting behavior of time averages in the same way that probabilities 
characterize the limiting behavior of relative frequencies. 

Limiting statements of the form of (4.2) and (4.3) are called laws 
of large numbers or ergodic theorems. They relate long-run sample 
averages or time average behavior to probabilistic calculations made 
at any given instant of time. It is obvious that such laws or theorems 
do not always hold. If the coin we are flipping wears in a known 
fashion with time so that the probability of a head changes, then one 
could hardly expect that the relative frequency of heads would equal 
the probability of heads at time zero. 

In order to make precise statements and to develop conditions 
under which the laws of theorems do hold, we first need to develop the 
properties of the quantity on the right-hand side of (4.2) and (4.3). In 
particular, we cannot at this point make any sense out of a statement 
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like “lim n _>oo S n = ' s ^ j apx{p)? since we have no definition for such 

aeA 

a limit of random variables or functions of random variables. It is 
obvious, however, that the usual definition of a limit used in calculus 
will not do, because S n is a random variable albeit a random variable 
whose “randomness” decreases in some sense with increasing n. Thus 
the limit must be defined in some fashion that involves probability. 
Such limits are deferred to a later section. We begin by looking at 
the definitions and calculus of expectations. 



4.2 Expectation 

Given a discrete alphabet random variable X specified by a pmf px, 
define the expected value , probabilistic average , or mean of X by 

E(X) = J2 apx(x) . (4.4) 

xeA 

The expectation is also denoted by EX or E[X] or by an over- 
bar, as X. The expectation is also sometimes called an ensemble 
average to denote averaging across the ensemble of sequences that is 
generated for different values of a; at a given instant of time. 

The astute reader might note that we have really provided two def- 
initions of the expectation of X. The definition of (4.4) has already 
been noted to be a special case of (2.34) with pmf px and function 
g(x) = x. Alternatively, we could use (2.34) in a more fundamental 
form and consider g(tv) = X (tv) is a function defined on an underly- 
ing probability space described by a pmf p or a pdf /, in which case 
(2.34) or (2.57) provide a different formula for finding the expection 
in terms of the original probability function: 

E(X) = Y J X{u)p{u>) (4.5) 

UJ 

if the original space is discrete, or 

E(X) = J X{r)f{r)dr (4.6) 

if it is continuous. Are these two versions consistent? The answer 
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is yes, as will be proved soon by the fundamental theorem of ex- 
pectation. The equivalence of these forms is essentially a change of 
variables formula. 

The mean of a random variable is a weighted average of the possible 
values of the random variable with the pmf used as a weighting. 
Before continuing, observe that we can define an analogous quantity 
for a continuous random variable possessing a pdf: If the random 
variable X is described by a pdf fx, then we define the expectation 
of X by 



EX 



xfx(x) dx , 



(4.7) 



where we have replaced the sum by an integral. Analogous to the 
discrete case, this formula is a special case of (2.57) with pdf f = fx 
and g being the identity function. We can also use (2.57) to express 
the expectation in terms of an underlying pdf, say /, as 



EX 



X(r)f(r) dr . 



The equivalence of these two formulas will be considered when the 
fundamental theorem of expectation is treated. 

While the integral does not have the intuitive motivation involving 
a relative frequency converging to a pmf that the earlier sum did, we 
shall see that it plays the analogous role in the laws of large numbers. 
Roughly speaking, this is because continuous random variables can 
be approximated by discrete random variables arbitrarily closely by 
very fine quantization. Through this procedure, the integrals with 
pdf’s are approximated by sums with pmf’s and the discrete alphabet 
results imply the continuous alphabet results by taking appropriate 
limits. Because of the direct analogy, we shall develop the properties 
of expectations for continuous random variables along with those for 
discrete alphabet random variables. Note in passing that, analogous 
to using the Stieltjes integral as a unified notation for sums and 
integrals when computing probabilities, the same thing can be done 
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for expectations. If Fx is the cdf of a random variable X, define 



if X discrete 
if X has a pdf 

In a similar manner, we can define the expectation of a mixture ran- 
dom variable having both continuous and discrete parts in a manner 
analogous to (3.36). 



r a 

EX = I xdF x (x) = | j 



^2xpx(x) 

xfx(x) dx 



Examples 

The following examples provide some typical expectation computa- 
tions. 

[4.1] As a slight generalization of the fair coin flip, consider the 
more general binary pmf with parameter p; that is, pxiX) = P 
and px(0) = 1 — p. In this case 

l 

EX — xpx(x) = 0(1 — p) + Ip = p . 

i = 0 

It is interesting to note that in this example, as is generally true 
for discrete random variables, EX is not necessarily in the al- 
phabet of the random variable, i.e., EX ^ 0 or 1 unless p = 0 or 
1. 

[4.2] A more complicated discrete example is a geometric random 
variable. In this case 

oo oo 

EX = kp x (k) = ]T Ml - pE 1 , 

k = 1 k = 1 

a sum evaluated in (2.48) as 1/p. 

[4.3] As an example of a continuous random variable, assume that 
X is a uniform random variable on [0,1], that is, its density is 
one on [0,1]. Here 

EX = f xfx(x) dx = f xdx — — , 

J o Jo 2 
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an integral evaluated in (2.67). 

[4.4] If X is an exponential random variable with parameter A, 
then from (2.71) 

poo 1 

/ r\e~ Xr dr = — . (4.8) 

Jo * 

In some case expectations can be found virtually by inspection. 
For example, if X has an even pdf fx — that is, if fx{~x) = fx(x) 
for all x G 5ft — then if the integral exists, EX = 0, since xfx{x ) 
is an odd function and hence has a zero integral. The assumption 
that the integral exists is necessary because not all even functions are 
integrable. For example, suppose that we have a pdf fx(x ) = c/x 2 
for all \x\ > 1, where c is a normalization constant. Then it is not true 
that EX is zero, even though the pdf is even, because the Riemann 
integral 




^dx 

ry* ^ 

>1 x 



does not exist. (The puzzled reader should review the definition of 
indefinite integrals. Their existence requires that the limit 



lim lim 

T — MX) S — »oo 



■S 



xfx(x) dx 



-T 



exists regardless of how T and S tend to infinity; in particular, the 
existence for the limit with the constraint T = S is not sufficient for 
the existence of the integral. These limits do not exist for the given 
example because 1/x is not integrable on [l,oo).) Nonetheless, it is 
convenient to set EX to 0 in this example because of the obvious 
intuitive interpretation. 

Sometimes the pdf is an even function about some nonzero value, 
that is, fx{x + m) = fx(x — m), where m is some constant. In this 
case, it is easily seen that if if the expectation exists, then EX = m, 
as the reader can quickly verify by a change of variable in the integral 
defining the expectation. The most important example of this is the 
Gaussian pdf, which is even about the constant m. 

The same conclusions also obviously hold for an even pmf. 
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In addition to the expectation of a given random variable, we will 
often be interested in the expectations of other random variables 
formed as functions of the given one. In the beginning of the chapter 
we introduced the relative frequency function, r K a y , which counts the 
relative number of occurrences of the value a in a sequence of n terms. 
We are interested in its expected value and in the expected value of 
the indicator function that appears in the expression for r K a . More 
generally, given a random variable X and a function g : !R — > 5ft, we 
might wish to find the expectation of the random variable Y = g(X). 
If X corresponds to a voltage measurement and g is a simple squaring 
operation, g(X) = X 2 , then g(X) provides the instantaneous energy 
across a unit resistor. Its expected value, then, represents the proba- 
bilistic average energy. More generally than the square of a random 
variable, the moments of a random variable X are defined by E[X k ] 
for k — 1,2,.... The mean is the first moment, the square is the sec- 
ond moment, and so on. If the random variable is complex- valued, 
then often the absolute moments i?[|X| fc ] are of interest. Moments 
are often useful as general parameters of a distribution, providing 
information on its shape without requiring the complete pdf or pmf. 
Some distributions are completely characterized by a few moments 
(e.g., the Gaussian). It is often useful to consider moments of a 
“centralized” random variable formed by removing its mean. The 
kth. centralized moment is defined by E[(X — E(X)) k ]. The kth cen- 
tralized absolute moment is defined by E[\X — E(X)\ k }. 

Of particular interest is the second centralized moment or vari- 
ance a 2 = E[(X — E(X)) 2 ]. Other functions that are of interest are 
indicator functions of a set, 1 f(%) = 1 if x F F and 0 otherwise, so 
that lp(X) is a binary random variable indicating whether or not 
the value of X lies in F, and complex exponentials e^ uX . 

Expectations of functions of random variables were defined in this 
chapter in terms of the derived distribution for the new random vari- 
able. In chapter 3, however, they were defined in terms of the original 
pmf or pdf in the underlying probability space, a formula not requir- 
ing that the new distribution be derived. We next show that the 
two formulas are consistent. First consider finding the expectation 
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of Y by using derived distribution techniques to find the probability 
function for Y . Then use the definition of expectation to evaluate 
EY . Specifically, if X is discrete, the pmf for Y is found as before as 

py(v) = Y p A x )i y G Ay - 

x: g(x)=y 

EY is then found as 

EY = ^2yp Y {y) • 

Although it is straightforward to find the probability function for 
y, it can be a nuisance if it is being found only as a step in the 
evaluation of the expectation EY = Eg(X). A second and easier 
method of finding EY is normally used. Looking at the formula 
for EX , it seems intuitively obvious that E(g(X)) should result if 
x is replaced by g(x). This can be proved by the following simple 
procedure. The expectation of Y is found directly from the pmf for 
X by starting with the pmf for Y, then substituting for its expression 
in terms of the pmf of X and reordering the summation: 



EY 



52 ypvty) = E y 

Ay Ay 



52 px 



x:g(x)=y 



E E g(x)p x (x) 



Ay \x:g(x)=y 



Y 9 ^Apx{x) . 

A x 



This little bit of manipulation is given the fancy name of the fun- 
damental theorem of expectation. It is a very useful formula in 
that it allows the computation of expectations of functions of ran- 
dom variables without the necessity of performing the (usually more 
difficult) derived distribution operations. 

A similar proof holds for the case of a discrete random variable 
defined on a continuous probability space described by a pdf. The 
proof is left as an exercise (problem 4.4). 

A similar change of variables argument with integrals in place of 
sums yields the analogous pdf result for continuous random vari- 
ables. As is customary, however, we have only provided the proof 
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for the simple discrete case. For the details of the continuous case, 
we refer the reader to books on integration or analysis. The reader 
should be aware that such integral results will have additional tech- 
nical assumptions (almost always satisfied) required to guarantee the 
existence of the various integrals. We summarize the results below. 



Theorem 4.1 The Fundamental Theorem of Expectation. 

Let a random variable X be described by a cdf Fx, which is in 
turn described by either a pmf px or a pdf fx - Given any mea- 
surable function g : 5ft — > 5ft, the resulting random variable Y = g(X ) 
has expectation 



EY = E(g(X)) = J ydF g{x) (y ) 



= J g(x) dF x 



' Yj 9 ^ PX ^ 

X 

< or 



g(x)f x (x) dx 



The qualification “measurable” is needed in the theorem to guar- 
antee the existence of the expectation. Measurability is satisfied by 
almost any function that you can think of and, for all practical pur- 
poses, the requirement can be neglected. 



Examples 

As a simple example of the use of this formula, consider a random 
variable X with a uniform pdf on [— ^]. Define the random vari- 

able Y — X 2 , that is g{r) = r 2 . We can use the derived distribution 
formula (3.40) to write 

fv(y) = y~Vx{y^) ; y > 0 , 



fv{y) = y 2 



y g (o, 



and hence 
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where we have used the fact that fx(y 2 ) is 1 only if the nonnegative 
argument is less than ^ or y <\. We can then find EY as 

r r\ 1 (1)3/2 i 

EY = j yfr(y) iy= j o yi<ly= y -^- = -. 

Alternatively, we can use the theorem to write 



EY = E{X 2 ) = 



Note that the result is the same for each method. However, the 
second calculation is much simpler, especially if one considers the 
work that was required in chapter 3 in deriving the density formula 
for the square of a random variable. 

[4.5] A second example generalizes an observation of chapter 2 
and shows that expectations can be used to express probabilities 
(and hence that probabilities can be considered as special cases 
of expectation). Recall that the indicator function of an event F 
is defined by 



1f(x) = 



1 if x E F 
0 otherwise 



The probability of the event F can be written in the following 
form which is convenient in certain computations: 

E1 f (X) = J l F (x) dF x (x) = dFx(x) = P X (F) , (4.9) 

where we have used the universal Stieltjes integral representation 
of (3.32) to save writing out both sums of pmf’s and integrals 
of pdf’s (the reader who is unconvinced by (4.9) should write 
out the specific pmf and pdf forms). Observe also that finding 
probability by taking expectations of indicator functions is like 
finding a relative frequency by taking a sample average of an 
indicator function. 



It is obvious from the fundamental theorem of expectation that 
the expected value of any function of a random value can be cal- 
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culated from its probability distribution. The preceding example 
demonstrates that the converse is also true: The probability distri- 
bution can be calculated from a knowledge of the expectation of a 
large enough set of functions of the random variable. The example 
provides the result for the set of all indicator functions. The choice 
is not unique, as shown by the following example: 

[4.6] Let g{pc) be the complex function e JUX where u is an arbitrary 
constant. For a cdf Fx, define 

E(g(X)) = E(e juX ) = J e jux dF x (x) . 

This expectation is immediately recognizable as the characteristic 
function of the random variable (or its distribution), providing a 
shorthand definition 



M x (ju) = E[e juX }. 



In addition to its use in deriving distributions for sums of inde- 
pendent random variables, the characteristic function can be used 
to compute moments of a random variable (as the Fourier transform 
can be used to find moments of a signal). For example, consider 
the discrete case and take a derivative of the characteristic function 
Mx(ju) with respect to u: 

^M X (ju) = d-Y J p x ^y ux = ^Zpx(x)(j X )F ux 

X X 



and evaluate the derivative at u = 0 to find that 

M x 't 0) = f-M x (ju)\ u = o = jEX. 
au 

Thus the mean of a random variable can be found by differentiating 
the characteristic function and setting the argument to 0 as 



EX 



M x \ 0) 



J 



(4.10) 
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Repeated differentiation can be used to show more generally that 
the kth moment can be found as 



E[X k ] = j~ k M^ k \o) 




d k 

du k 



M x (ju) 



u = 0 • 



(4.11) 



If one needs several moments of a given random variable, it is usually 
easier to do one integration to find the characteristic function and 
then several differentiations than it is to do the several integrations 
necessary to find the moments directly. Note that if we make the 
substitution w = ju and differentiate with respect to re, instead of 

u , 



d k 

—M x (w )\ w= o = E(X k ) . 

Because of this property, characteristics function with ju = w are 
called moment- generating functions. From the defining sum or in- 
tegral for characteristic functions in example [4.6], the moment- 
generating function may not exist for all w = v + ju , even when it 
exists for all w = ju with u real. This is a variation on the idea 
that a Laplace transform might not exist for all complex frequencies 
s = a + ju) even when the it exists for all s — juj with oj real, that 
is, when the Fourier transform exists. 

Example [4.6] illustrates an obvious extension of the fundamental 
theorem of expectation. In [4.6] the complex function is actually a 
vector function of length 2. Thus it is seen that the theorem is valid 
for vector functions, g(x), as well as for scalar functions, g(x). The 
expectation of a vector is simply the vector of expected values of the 
components. 

As a simple example, recall from (3.109) that the characteristic 
function of a binary random variable X with parameter p = px{ 1) = 
1 -Px(0) is 



Mx(ju) = (1 — p) +pe^ u . (4. 12) 



It is easily seen that 



M x \ 0) 



P = E[X } 



~M X "){ 0) = -M%(0)p = E[X 2 }. 



J 
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As another example, consider J\f(m,a 2 ) the Gaussian pdf with 
mean m and variance a 2 . Differentiating easily yields 

Mx = m = E[X\ , -M/( 0) =a 2 x + m 2 = E[X 2 }. 

3 

The relationship between the characteristic function of a distribu- 
tion and the moments of a distribution becomes particularly striking 
when the characteristic function is sufficiently nice near the origin 
to possess a Taylor series expansion. The Taylor series of a function 
f(u) about the point u — 0 has the form 



oo 



/(«) = E 



u 



k 



o) 



k = 0 



k\ 



= /(0) + u/ (i) (0) + u 



2 / ( 2 ) ( 0 ) , . ..fc 



+ terms in u k ; k > 3 (4.13) 



where the derivatives 



rjk 

/w<o) = ; 

are assumed to exist, that is, the function is assumed to be analytic at 
the origin. Combining the Taylor series expansion with the moment- 
generating property (4.11) yields 



oo 



Mx(ju) = 

k = 0 



k 1V1 X 



( 0 ) 



oo 



k\ 






k 



k=0 



E(X k ) 

k\ 



= 1 + juE(X) - \E(X 2 ) + o('u 2 )/2 



(4.14) 



where o(u 2 ) contains higher order terms that go to zero as u — > 0 
faster than u 2 . 

This result has an interesting implication: knowing all of the mo- 
ments of the random variable is equivalent to knowing the behavior 
of the characteristic function near the origin. If the characteristic 
function is sufficiently well behaved for the Taylor series to be valid 
over the entire range of u rather than just in the area around 0, 
then knowing all of the moments of a random variable is sufficient to 
know the transform. Since the transform in turn implies the distribu- 
tion, this guarantees that knowing all of the moments of a random 
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variable completely describes the distribution. This is true, how- 
ever, only when the distribution is sufficiently “nice,” that is, when 
the technical conditions ensuring the existence of all of the required 
derivatives and of the convergence of the Taylor series hold. 

The approximation of the first three terms of (4.14) plays an im- 
portant role in the central limit theorem, so it is worth pointing out 
that it holds under even more general conditions than having an an- 
alytic function. In particular, if X has a second moment so that 
E[X 2 } < oc, then 



M x {ju) 



v?E(X 2 ) , o\ 

1 + juE{X) + o{u 2 ). 



(4.15) 



See, for example, Breiman’s treatment of characteristic functions [7]. 

The most important application of the characteristic function is its 
use in deriving properties of sums of independent random variables, 
as was be seen in (3.108). 



4.4 Functions of Several Random Variables 

Thus far expectations have been considered for functions of a single 
random variable, but it will often be necessary to treat functions 
of multiple random variables such as sums, products, maxima, and 
minima. For example, given random variables U and V defined on a 
common probability space we might wish to find the expectation of 
Y — g(U,V). The fundamental theorem of expectation has a natural 
extension (which is proved in the same way). 



Theorem 4.2 Fundamental Theorem of Expectation for Functions 
of Several Random Variables 

Given random variables Xq, Vi, . . . , Xk-i described by a cdf 
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Fx Q.Xi,... ) x k _ 1 an d given a measurable function g : — > 5ft, 



E\g(X 0 ,...,X k -i)} 

= J 9(xo, ■ ■ ■ ,Xk-i) dFxo^Xk-Axo, . . . ,x k -i) 

f ^ ^ gifv o, • • • 5 %k— i)pxq,...,X/ c _i (^o? • • • 5 *^/c— l) 



= < 



^Ovj^fc — 1 

or 



J g{%0, • , ^fe-i)/x 0 ,...,x fc _i Oo, • • • ? #fc-i)d#o • • • dxk - i • 



We will consider correlation, covariance, multidimensional charac- 
teristic functions, and differential entropy as examples of the expec- 
tation of several random variables. First, however, we develop some 
simple and important properties of expectation that will be needed. 



4.5 Properties of Expectation 

Expectation possesses several basic properties that will prove useful. 
We now present these properties and prove them for the discrete 
case. The continuous results follow by using integrals in place of 
sums. 



Property 1. If X is a random variable such that Pr(X > 0) = 1, 
then EX > 0. 



Proof Pr(X > 0) = 1 implies that the pmf px{%) — 0 for x < 0. If 
Px{%) is nonzero only for nonnegative x, then the sum defining the 
expectation contains only terms xpx(x) > 0, and hence them sum 
which equals EX is nonnegative. Note that property 1 parallels 
Axiom 2.1 of probability. That is, the nonnegativity of probability 
measures implies property 1. □ 

Property 2. If A is a random variable such that for some fixed 
number r, Pr(X = r) = 1, then EX = r. Thus the expectation of a 
constant equals the constant. 
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Proof Pr(X — r) — 1 implies that px( r ) — 1- Thus the result follows 
from the definition of expectation. Observe that property 2 parallels 
Axiom 2.2 of probability. That is, the normalization of the total 
probability to 1 leaves the constant unsealed in the result. If total 
probability were different from 1, the expectation of a constant as 
defined would be a different, scaled value of the constant. □ 

Property 3. Expectation is linear; that is, given two random 
variables X and Y and two real constants a and 6, 

E(aX + bY) = aEX + bEY . 

Proof Let g(x,y) = ax + by, where a and b are constants. In this 
case the fundamental theorem of expectation for functions of several 
(here two) random variables implies that 

E[aX + bY] = ^(ax + by)p x ,Y{x, y ) 

x,y 

= a Y x Y px x( x ’ y) + b E y E px x( x i y ) • D 

x y y x 

Using the consistency of marginal and joint pmf’s of (3.13)-(3.14) 
this becomes 

E[aX + bY] = a'^xpx(x) + b^^ypy^y) 

x y 

= aE(X) + bE(Y). (4.16) 

Keep in mind that this result has nothing to do with whether or 
not the random variables are independent. 

The linearity of expectation follows from the additivity of prob- 
ability. That is, the summing out of joint pmf’s to get marginal 
pmf’s in the proof was a direct consequence of Axiom 2.4 . The 
alert reader will likely have noticed the method behind the presen- 
tation of the properties of expectation — each follows directly from 
the corresponding axiom of probability. Furthermore, using (4.9), 
the converse is true: That is, instead of starting with the axioms 
of probability, suppose we start by using the properties of expecta- 
tion as the axioms of expectation. Then the axioms of probability 
become the derived properties of probability. Thus the first three ax- 
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ioms of probability and the first three properties of expectation are 
dual; one can start with either and get the other. One might suspect 
that to get a useful theory based on expectation, one would require 
a property analogous to Axiom 2.4 of probability, that is, a limiting 
form of expectation property 3. This is, in fact, the case, and the 
fourth basic property of expectation is the countably infinite version 
of property 3. When dealing with expectations, however, the fourth 
property is more often stated as a continuity property. That is, it 
is stated in a form analogous to Axiom 2.4 of probability given in 
equation (2.28). For reference we state the property below without 
proof. 

Property 4. Given an increasing sequence of nonnegative ran- 
dom variables X n ; n — 0,1,2,..., that is, X n > X n -\ for all n (i.e., 
X n (uj) > X n -i{uf) for all uj G 12), which converge to a limiting ran- 
dom variable X = lim n _ >00 X n , then 



E 



lim X 



n 



jn — »oo 



lim EX n . 

n—>oo 



Thus as with probabilities, one can in certain cases exchange the 
order of limits and expectation. The cases include but are not limited 
to those of property 4. Property 4 is called the monotone conver- 
gence theorem and is one of the basic properties of integration as 
well as expectation. This theorem is discussed in appendix B along 
with another important limiting result, the dominated convergence 
theorem. 

In fact, the four properties of expectation can be taken as a defi- 
nition of an integral (viz., the Stieltjes integral) and used to develop 
the general Lebesgue theory of integration. That is, the theory of 
expectation is really just a specialization of the theory of integration. 
The duality between probability and expectation is just a special case 
of the duality between measure theory and the theory of integration. 
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4.6 Examples 

4-6.1 Correlation 

We next introduce the idea of correlation or expection of products 
of random variables that will lead to the development of a property 
of expectation that is special to independent random variables. A 
weak form of this property will be seen to provide a weak form of 
independence that will later be useful in characterizing certain ran- 
dom processes. Correlations will later be seen to play a fundamental 
role in many signal processing applications. Suppose we have two in- 
dependent random variables X and Y and we have two functions or 
measurements on these random variables, say g(X) and h(Y), where 
g : 5ft — > 5ft, h : 5ft — > 5ft, and E[g(X)\ and E[h(Y)\ exist and are finite. 
Consider the expected value of the product of these two functions, 
called the correlation between g(X) and h(Y). As we shall consider 
in more detail later, if we are considering complex valued random 
variables or functions, the correlation of g(X) and h(Y) is defined by 
E[g(X)h(Y)*], where the aster ix denotes the complex conjugate. For 
the time being, however, we focus on the simpler case of real- valued 
random variables and functions. 

Applying the two-dimensional vector case of the fundamental the- 
orem of expectation to discrete random variables results in 

E(g(X)h(Y)) = J r g (x)h(y)p x ,Y(x,y ) 

x,y 

= EE^( x)h(y)p x (x)p Y {y) 

x y 

Es(*)Px(*)) (YHy)p Y (y) 

= (E(g(X)))(E(h(Y))) . 

A similar manipulation with integrals shows the same to be true 
for random variables possessing pdf’s. Thus we have proved the 
following result, which we state formally as a lemma. 




Lemma 4.1 For any two independent random variables X and Y , 

E(g(X)h(Y)) = (Eg(X))(Eh(Y)) ( 4 . 17 ) 




